1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

Size: px
Start display at page:

Download "1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1"

Transcription

1 A Scalable, Distributed Web-Crawler* Ankit Jain, Abhishek Singh, Ling Liu Technical Report GIT-CC College of Computing Atlanta,Georgia In this paper we present a design and implementation of a scalable, distributed web-crawler. The motivation for design of such a system to effectively distribute crawling tasks to different machined in a peer-peer distributed network. Such architecture will lead to scalability and help tame the exponential growth or crawl space in the World Wide Web. With experiments on the implementation of the proto-type of the system we derive show the scalability and efficiency of such a system. 1. Introduction Web crawler forms an integral part of any search engine. The basic task of a crawler is to fetch pages, parse them to get more URLs, and then fetch these URLs to get even more URLs. In this process crawler can also log these pages or perform several other operations on pages fetched according to the requirements of the search engine. Most of these auxiliary tasks are orthogonal to the design of the crawler itself. The explosive growth of the web has rendered the simple task of crawling the web non-trivial. With this rapid increase in the search space, crawling the web is becoming more difficult day by day. But all is not lost, newer computational models are being introduced to make resource intensive tasks more manageable. The price of computing is decreasing monotonically. It has now become very economical to use several cheap computation units in distributed fashion to achieve high throughputs. The challenge while using a distributed model such as one described above, is to efficiently distribute the computation tasks avoiding overheads for synchronization and maintenance of consistency. Scalability is also an important issue for such a model to be usable. In this project, design architecture of a scalable, distributed web crawler has been proposed and implemented. It has been designed to make use of cheap resources and tries to remove some of the bottleneck of the present crawlers in novel way. For sake of simplicity and focus, we only worked on the crawling part of the crawler, logging only the URL s. Other functions can be easily integrated to the design. Section 2 talks about the salient features of our design. In section 3 gives an overview of the proposed architecture. Section 4 goes into the details of a crawler entity in our architecture. In section 5 we explain the probabilistic hybrid search model. In section 6 we talk in brief about the implementation of our system. Section 7 discusses the experimental results and there interpretations. In later sections we discuss out conclusions and describe the learning experience during this project. 2. Salient features of the design Our major objectives while designing the crawler were Increased resource utilization (by multithreaded programming to increase concurrency) Effective distribution of crawling tasks with no central bottleneck Easy portability Limiting the request load for all the web servers Configurability of the crawling tasks * The manuscript is still under progress 1

2 Besides catering to these capabilities our design also includes probabilistic hybrid search model. This is done using a probabilistic hybrid of stack and queue ADTs (Abstract Data Type) for maintaining the pending URL lists. Details of the probabilistic hybrid model are presented in section 5. This distributed crawler is a peer-to-peer distributed crawler, with no central entity. By using a distributed crawling model we have overcome the bottlenecks like: Network throughput Processing capabilities Database capabilities Storage capabilities. A database capability bottleneck is avoided by dividing the URL space into disjoint sets, each of which is handled by a separate crawler. Each crawler parses and logs only the URLs that lie in its URL space subset, and forwards rest of the URL to corresponding crawler entity. Each crawler will have a prior knowledge of the look up table relating each URL subset to [IP:PORT] combination identifying all the crawler threads 3. Distributed crawler The crawler system consists of a number of crawler entities, which run on distributed sites and interact in peer-to-peer fashion. Each crawler entity has the knowledge to its URL subset, as well as mapping from URL subset to network address of corresponding peer crawler entity. Whenever the crawler entity encounters a URL from a different URL subset, it is forwarded to the appropriate peer crawler entity based on URL subset to crawler entity lookup. Each crawler entity maintains its own database, which only stores the URL s from the URL subset assigned to the particular entity. The database s are disjoint and can be combined offline when the crawling task is complete. * The manuscript is still under progress 2

3 4. Crawler Entity Each crawler entity consists of several of crawler threads, a URL handling thread, a URL packet dispatcher thread and URL packet receiver thread. The URL set assigned to each crawler entity will be further divided into subsets for each crawler thread. Each crawler thread has its own pending URL list. Each thread picks up an element from URL pending list, generates an HTTP fetch requests, gets the page, parses through this page to extracts any URL s in it and finally puts them in the job pending queue of the URL handling thread. During initialization URL handling thread reads the hash to [IP:PORT] mapping. It also has a job queue. This thread gets a URL from the job queue, checks to see if the URL belongs to the URL set corresponding to the crawler entity. It does so based on the last few bits of the hash of the domain name in the URL with conjunction of hash to [IP:PORT] mapping. If the URL belongs to another entity it will put the URL on the dispatcher queue and get a new URL from its job queue. If the URL belongs to its set, it firsts checks the URL-seen cache, if the test fails it queries the URL database to check if the URL has been seen, and puts the URL in the URL database. It then puts the URL into URL pending list of one of the crawler threads. URLs are assigned to a crawler thread based on domain names. Each domain name will only be serviced by one thread; hence only one connection will be maintained with any given server. This will make sure that the crawler doesn t overload a slow server. A different hash is used while distributing jobs in between the crawler thread and while determining the URL subset. The objective behind this to isolate the two operations such that there is no correlation between a crawler entity and the thread that is assigned to it; thus balancing the load evenly within the threads. The decision to divide URL space on the bases to domain names was based on the observation that a lot of pages on the web tend to have links to pages in the same domain name. Hence if all URL s with a particular domain name will lie in the same URL space, these URL s will not be needed to be forwarded to other crawler entities. Thus this scheme provides and effective strategy to divide the crawl task between different peer-to-peer nodes of this distributed system. We validate this argument in our experiments described in Section 7. URL dispatcher thread communicates the URL s corresponding crawler entity. A URL receiver thread collects the URL s received from other crawler entities i.e. communicated via dispatcher threads of those crawler entities and puts them on the job queue of the URL handling thread. 5. Probabilistic Search Model We use a search model that can be configured to behave as DFS, BFS or a hybrid of both. It can be configured to behave as DFS a given percentage of the times, and behave as BFS the rest of the times. We use a probabilistic hybrid of stack and queue abstract data types to store the list of pending URL s. DFS can be modeled by using a Stack to store the URL pending list. In a stack last in first out order is maintained for element lists. In a stack we push elements and pop elements from the same end of a list. Similarly BFS can be modeled by using a queue to store the URL pending list, which maintains the First In First Out order for list elements This is achieved by pushing the elements into one end of a list and popping elements from the other. In short if we pop and push elements from the same end we get DFS and if we pop and push from different ends we get BFS. We use above fact to obtain a hybrid of DFS and BFS. We push elements from one end of the list, and pop elements from the same end of list with probability p and from the other end of this list with probability 1- p. Now if p =1, then the system will behave as DFS, if p = 0, the systems will behave as BFS. For p anywhere between 0 and 1, the system will behave as DFS p*100% times and BFS rest of the times. Each time we need to pop an element we decide with probability p, whether to get the element from the top of the list or the bottom. * The manuscript is still under progress 3

4 By varying value of p the search characteristic will change, this will effect the cache-hit ratio and the coverage of the search. We intended to find the optimum value of p, which yields highest cache-hit rate for both the DNS cache as well as URL-seen cache. This study could lead to a significant improvement in the crawler performance. We have implemented this hybrid structure but due to time constraints we could not perform this study in this project. 6. The Implementation The system was implemented in Java platform for portability reasons. MySQL was used for the URL database. Even though Java is less efficient than other languages that can be compiled to the native machine code and none of the team members were proficient with it, we selected Java for this prototype. The reasons behind this decision were to keep the software architecture modular, make the system portable, and to deal with complexity of such a system. In retrospect this turned to be a good decision as we might not have been able to complete this project in time if we would have implemented it in other languages such as C. The comprehensive libraries provided with Java us to concentrate our efforts on design of the system and software architecture. A java class was written for each of the various components of the system ( i.e. different kind of threads, database, synchronized job queues, caches etc.). First we wrote generic classes for various infrastructure components of the system like synchronized job queues and caches. The LRUCache class implements an approximate LRU cache based of hash table with overlapping buckets. The JobQueue class implements a generic synchronized job queue with option for probabilistic hybrid of stack and queue ADT. The main Crawler class performs the initialization, by reading the configuration files, spawning various threads accordingly and initializing various job queues. It then behaves as the Handler Thread. A class named CrawlerThread performs the operation of the Crawler Thread. This thread simply gets a URL from its job queue, messages the URLlist class with this URL. The URLlist class then spawns a new thread that fetches the page, parses it for URL links and returns the list of these URL s back to the CrawlerThread. In java the URL fetch operation is not guaranteed to return and in case of a malicious web server the whole thread can possibly hang, waiting for the operation to complete. This is the reason why the URLlist class spawns a new thread every time to fetch the URL. The thread is completed with a certain time-out, hence if the URL fetch operation isn t completed in time the thread stops after time-out and normal operation is resumed. Spawning a new thread to fetch each page does put an extra overhead on the operation but is essential for the robustness of the system. The Sender and Receiver classes implement the Sender and Receiver threads respectively. The Receiver class starts a UDP socket at pre-determine port and waits for any packet. The Sender class transmits the URL s via UDP packet to appropriate remote node. Besides the classes that form the system architecture described before, we added a Probe Thread to the system and a Measurement class. The relevant classes report the appropriate measurements to the Measurements class and the Probe Threads messages the Measurement class to output the measurements at configurable periodic time intervals. 7. Evaluation and Results We performed experiments to evaluate the performance and scalability of the system. Our experimental setup consisted of four Sun Ultra-30 machines. One crawler entity ran in each of these machines and each of the entity was configured to have 12 Crawler Threads. During the design we decided to store all the queues in memory as the cost of memory is really low and several cheap computer system come equipped with 2 GB RAM. Our program would never require more memory. As 2GB RAM could accommodate about 20 million URLs in various queues for each entity. We do not expect that queue size of any particular node will become more than 20 million when URL space is divided into several nodes. Unfortunately * The manuscript is still under progress 4

5 we could not arrange such machines for our experiments. Instead we ran our experiments on machines only with 128 Mb RAM with even less memory available for our process. In our experiment [Figure 2] we faced problems due to unavailability of required memory space. The nodes failed after memory overflow. Arrows in the above graph depict node failure. First of the systems went down in about 12 minutes due to memory overflow. By this time the system had crawled about pages giving a throughput of documents per second. The second node went down after about twenty minutes; the throughput at this time was pages per second. The third and fourth node do down at about 57 minutes with throughput of about 31.4 pages per second. This result although not straightforward to interpret due failure of the nodes, is still very promising. At about 74 documents per second, one billion pages can be crawled in less than six months. Surely tests with machines with required amount of memory need to be performed to corroborate this throughput. * The manuscript is still under progress 5

6 In figure 3, we show the queue sizes and number of pages crawled for one of the 4 systems in the above experiments. As seen in the graph the number of pages crawled is fairly linear indicating almost constant throughput though out the run. This graph also justifies our decision of keeping one handler thread per crawler entity. As seen from the graph, except for few temporary bursts the handler thread queue length is fairly low. Thus it can be inferred that one handler thread is enough to quickly execute its functions even for multiple crawler threads. Worker queue length identifies the culprit for the memory requirement. It increases at a rate much higher than rate at which pages are crawled. To study the scalability of the system we find scalability factor of the system with 4 nodes. Scalability Factor = Throughput with 4 nodes working together/ Throughput with 4 nodes working independently. We calculated the scalability factor after first ten minutes of the execution of crawler. We found this value to be 97.9%. This figure shows extremely good capability for scalability as the system show only about 2% overhead for distribution task i.e. the distribution of task was fairly effective. We also measured the number of URL s that needed to be forwarded to other peer nodes. For this reason we introduce the distribution factor Distribution Factor = Number of Local pages found/ Number of pages found. Here local pages are to pages that belong to the same URL subset as there parents. Hence these local pages are not forwarded to other peer nodes and do not lead to network traffic. Needles to say the higher the distribution factor the better as it leads to effective distribution of the crawl space. If web were to be a random set of hyper link structure. The expected value of the Distribution factor would be 25% for our case of 4 nodes. In our experimentation we found the distribution factor to be 65% (averaged over more than 100,000 pages the system crawled). This again validates our claim that dividing the URL s into subsets based on the domain names and then assigning a URL subset to each node is an effective distribution of crawl task. Our next experiment aims to explore the level of concurrency attainable in one of the crawler entity of the system. In this experiment we use only one node and then measure the performance of the system at the end of 5 minutes on varying the number of crawler threads. * The manuscript is still under progress 6

7 This graph [figure 4] although not very smooth, provides clear indication of increased throughput on increasing the number of crawler thread, validating our claim of increased resource utilization. Beyond about 48 threads the throughput start to decrease because of the synchronization overheads of the system. The graph suggests that around 32 to 48 crawler threads per crawler entity may provide optimum performance to the system. In these experiments a single node achieved the throughput of 32 documents per second, again a very promising fact in terms of system performance. 8. Contributions of this project The biggest contribution of this project is the concept of distributing crawl tasks based on disjoint subsets of the URL crawl space. We also presented a scalable, multi-threaded, peerto-peer distributed architecture for a WebCrawler based on the above concept. Another interesting contribution of the project is the proposed probabilistic hybrid of Depth-First Traversal and Breath-First Traversal, although we were unable to study its advantages or disadvantages during this project. This traversal strategy can be used to achieve the hybrid of the two traditional strategies without any extra book-keeping and is very easy to implement. We also implement the complete WebCrawler that demonstrates all of the above concepts. 9. The Learning experience Foundations of this project were laid from the discussions on Web-crawlers and challenges that lie in there design. Since web space is growing exponentially so our proposed solution should be scalable. Proposed solution should be capable of making good use of cluster of computers rather than being dependent on a large capacity machine. Discussions about the DFS and BFS navigation strategies for the efficient crawl of the web prompted us to experiment with a probabilistic navigation strategy. A lot of papers referred in the class, especially [2]; also gave us insights into the design and implementation of such a system. Design and implementation of this project was a very fruitful hands-on experience. It turned out to be a very good design exercise. We had to deal with real world system issues. The project was initiated by a task distribution idea, but to demonstrate the usefulness of such a concept we had to design a whole system that used exploited this idea in its architecture. During designing the architecture we were faced with challenges associated to Internet Applications, related decisions and trade-offs. Thus this project covered designing internet application from the scratch; from design principles to designing a system architecture, and then implementation as well as evaluation of the system. We implemented the whole project on our own, using the Java libraries. This is itself was a very useful. Due to the nature of internet applications such as this one, it is always important to emphasize on efficient implementation as well as portability of the system. Also this system included components from various domains. We implemented multi-threaded architecture, synchronized job queues, LRU caches, crawlers, other networking component, database query components etc. Even though we had studied these components before, in this project we implemented all these components, which gave us insights into the implementation issues of these useful components. Besides implementing these components during the implementation we also got an experience with integrating these components to make the whole system work. During the evaluation we dealt with designing and executing experiments to validate our claims, this also provided us insights into proper interpretation of the experimental results and logical derivation of conclusions. Through out this project we experienced the fact mention in the class that developing a WebCrawler is easy, but developing an efficient WebCrawler is very difficult * The manuscript is still under progress 7

8 10. Future extensions Future extension of the project includes implementing the DNS cache in the Crawler Thread and studying the performance of the hybrid traversal strategy on the various cache-hit rates. A lot of issues need to be dealt with to make this system usable in the real world. The Crawler needs to conform to robot exclusion protocol. We need to handle partial failure. Although at present failure of one node will not stop other components, it would be desirable for other system to take over the task of the node that failed. Also dynamic reconfiguration and dynamic load-balancing would be desirable. 11. Related work Google [1] web crawler is written in python and is a single threaded and uses asynchronous I/O to fetch data from several concurrent connections. The crawler transmits downloaded pages to a single Store Server process. The store server compresses the data and stores in repository. Another famous web crawler is Mercator[2], which is a multithreaded web crawler in JAVA. Although Mercator is not distributed, it does divide the URL space like our design to guarantee that only one thread will contact a given server. We do not deal with storing webpages or process of indexing in this project. Our architecture is distributed as well as multithreaded. This way we increase concurrency in a single machine as well as the entire system of several computers. We also have a distributed database with no central bottleneck. We also make use of probabilistic search model for crawling web pages. The whole combination of features, improved resource utilization and scalability distinguishes us from related previous work. 12. Conclusion In all the performance results of the crawler are very promising. We achieved a throughput rate of 75 documents per second. This is an encouraging result as at 31.7 pages per second one billion documents can be crawled in one year. We have also validated our claims of scalability and improved resource utilization with the experimental result. Although the results are encouraging, more tests needed to be conducted to find out if such system can be really useful in the real world situation. References [1] Google: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page, Proceedings of the 7th International World Wide Web Conference, pages , April 1998 [2] Mercator: A Scalable, Extensible Web Crawler, Allan Heydon and Marc Najork, Compaq Systems Research Center. [3] Class notes * The manuscript is still under progress 8

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

A Novel Interface to a Web Crawler using VB.NET Technology

A Novel Interface to a Web Crawler using VB.NET Technology IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 59-63 A Novel Interface to a Web Crawler using VB.NET Technology Deepak Kumar

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Enhancement to PeerCrawl: A Decentralized P2P Architecture for Web Crawling

Enhancement to PeerCrawl: A Decentralized P2P Architecture for Web Crawling CS8803 Project Enhancement to PeerCrawl: A Decentralized P2P Architecture for Web Crawling Mahesh Palekar, Joseph Patrao. Abstract: Search Engines like Google have become an Integral part of our life.

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?

More information

Summary Cache based Co-operative Proxies

Summary Cache based Co-operative Proxies Summary Cache based Co-operative Proxies Project No: 1 Group No: 21 Vijay Gabale (07305004) Sagar Bijwe (07305023) 12 th November, 2007 1 Abstract Summary Cache based proxies cooperate behind a bottleneck

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Search Today Search Index Crawl What s Wrong? Users have a limited search interface Today s web is dynamic and

More information

PeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

PeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web PeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web VAIBHAV J. PADLIYA MS CS Project (Fall 2005 Spring 2006) 1 PROJECT GOAL Most of the current web crawlers use a centralized

More information

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine Debajyoti Mukhopadhyay 1, 2 Sajal Mukherjee 1 Soumya Ghosh 1 Saheli Kar 1 Young-Chon

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Collection Building on the Web. Basic Algorithm

Collection Building on the Web. Basic Algorithm Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

A Distributed Query Engine for XML-QL

A Distributed Query Engine for XML-QL A Distributed Query Engine for XML-QL Paramjit Oberoi and Vishal Kathuria University of Wisconsin-Madison {param,vishal}@cs.wisc.edu Abstract: This paper describes a distributed Query Engine for executing

More information

Seminar report Google App Engine Submitted in partial fulfillment of the requirement for the award of degree Of CSE

Seminar report Google App Engine Submitted in partial fulfillment of the requirement for the award of degree Of CSE A Seminar report On Google App Engine Submitted in partial fulfillment of the requirement for the award of degree Of CSE SUBMITTED TO: SUBMITTED BY: www.studymafia.org www.studymafia.org Acknowledgement

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey EECS 395/495 Lecture 5: Web Crawlers Doug Downey Interlude: US Searches per User Year Searches/month (mlns) Internet Users (mlns) Searches/user-month 2008 10800 220 49.1 2009 14300 227 63.0 2010 15400

More information

Flash: an efficient and portable web server

Flash: an efficient and portable web server Flash: an efficient and portable web server High Level Ideas Server performance has several dimensions Lots of different choices on how to express and effect concurrency in a program Paper argues that

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Student Name:.. Student ID... Course Code: CSC 227 Course Title: Semester: Fall Exercises Cover Sheet:

Student Name:.. Student ID... Course Code: CSC 227 Course Title: Semester: Fall Exercises Cover Sheet: King Saud University College of Computer and Information Sciences Computer Science Department Course Code: CSC 227 Course Title: Operating Systems Semester: Fall 2016-2017 Exercises Cover Sheet: Final

More information

Broker Clusters. Cluster Models

Broker Clusters. Cluster Models 4 CHAPTER 4 Broker Clusters Cluster Models Message Queue supports the use of broker clusters: groups of brokers working together to provide message delivery services to clients. Clusters enable a Message

More information

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles

More information

YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION

YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2013 YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION Akshat Kukreti Follow this and additional

More information

Active Server Pages Architecture

Active Server Pages Architecture Active Server Pages Architecture Li Yi South Bank University Contents 1. Introduction... 2 1.1 Host-based databases... 2 1.2 Client/server databases... 2 1.3 Web databases... 3 2. Active Server Pages...

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

ScaleArc for SQL Server

ScaleArc for SQL Server Solution Brief ScaleArc for SQL Server Overview Organizations around the world depend on SQL Server for their revenuegenerating, customer-facing applications, running their most business-critical operations

More information

TinyTorrent: Implementing a Kademlia Based DHT for File Sharing

TinyTorrent: Implementing a Kademlia Based DHT for File Sharing 1 TinyTorrent: Implementing a Kademlia Based DHT for File Sharing A CS244B Project Report By Sierra Kaplan-Nelson, Jestin Ma, Jake Rachleff {sierrakn, jestinm, jakerach}@cs.stanford.edu Abstract We implemented

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

CLIENT SERVER ARCHITECTURE:

CLIENT SERVER ARCHITECTURE: CLIENT SERVER ARCHITECTURE: Client-Server architecture is an architectural deployment style that describe the separation of functionality into layers with each segment being a tier that can be located

More information

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18 PROCESS VIRTUAL MEMORY CS124 Operating Systems Winter 2015-2016, Lecture 18 2 Programs and Memory Programs perform many interactions with memory Accessing variables stored at specific memory locations

More information

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment This document is provided as-is. Information and views expressed in this document, including

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P? Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems

More information

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters CLUSTERING HIVEMQ Building highly available, horizontally scalable MQTT Broker Clusters 12/2016 About this document MQTT is based on a publish/subscribe architecture that decouples MQTT clients and uses

More information

12 Abstract Data Types

12 Abstract Data Types 12 Abstract Data Types 12.1 Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define the concept of an abstract data type (ADT). Define

More information

COMP3121/3821/9101/ s1 Assignment 1

COMP3121/3821/9101/ s1 Assignment 1 Sample solutions to assignment 1 1. (a) Describe an O(n log n) algorithm (in the sense of the worst case performance) that, given an array S of n integers and another integer x, determines whether or not

More information

Lecture 1: January 23

Lecture 1: January 23 CMPSCI 677 Distributed and Operating Systems Spring 2019 Lecture 1: January 23 Lecturer: Prashant Shenoy Scribe: Jonathan Westin (2019), Bin Wang (2018) 1.1 Introduction to the course The lecture started

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Ninth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

P2P. 1 Introduction. 2 Napster. Alex S. 2.1 Client/Server. 2.2 Problems

P2P. 1 Introduction. 2 Napster. Alex S. 2.1 Client/Server. 2.2 Problems P2P Alex S. 1 Introduction The systems we will examine are known as Peer-To-Peer, or P2P systems, meaning that in the network, the primary mode of communication is between equally capable peers. Basically

More information

Cluster-Based Scalable Network Services

Cluster-Based Scalable Network Services Cluster-Based Scalable Network Services Suhas Uppalapati INFT 803 Oct 05 1999 (Source : Fox, Gribble, Chawathe, and Brewer, SOSP, 1997) Requirements for SNS Incremental scalability and overflow growth

More information

CS5412: TRANSACTIONS (I)

CS5412: TRANSACTIONS (I) 1 CS5412: TRANSACTIONS (I) Lecture XVII Ken Birman Transactions 2 A widely used reliability technology, despite the BASE methodology we use in the first tier Goal for this week: in-depth examination of

More information

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Crawling the Web. Web Crawling. Main Issues I. Type of crawl Web Crawling Crawling the Web v Retrieve (for indexing, storage, ) Web pages by using the links found on a page to locate more pages. Must have some starting point 1 2 Type of crawl Web crawl versus crawl

More information

Market Data Publisher In a High Frequency Trading Set up

Market Data Publisher In a High Frequency Trading Set up Market Data Publisher In a High Frequency Trading Set up INTRODUCTION The main theme behind the design of Market Data Publisher is to make the latest trade & book data available to several integrating

More information

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste Overview 5-44 5-44 Computer Networking 5-64 Lecture 6: Delivering Content: Peer to Peer and CDNs Peter Steenkiste Web Consistent hashing Peer-to-peer Motivation Architectures Discussion CDN Video Fall

More information

SUMMARY OF DATABASE STORAGE AND QUERYING

SUMMARY OF DATABASE STORAGE AND QUERYING SUMMARY OF DATABASE STORAGE AND QUERYING 1. Why Is It Important? Usually users of a database do not have to care the issues on this level. Actually, they should focus more on the logical model of a database

More information

Lecture 8: February 19

Lecture 8: February 19 CMPSCI 677 Operating Systems Spring 2013 Lecture 8: February 19 Lecturer: Prashant Shenoy Scribe: Siddharth Gupta 8.1 Server Architecture Design of the server architecture is important for efficient and

More information

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez. Running Head: 1 How a Search Engine Works Sara Davis INFO 4206.001 Spring 2016 Erika Gutierrez May 1, 2016 2 Search engines come in many forms and types, but they all follow three basic steps: crawling,

More information

Design and Implementation of A P2P Cooperative Proxy Cache System

Design and Implementation of A P2P Cooperative Proxy Cache System Design and Implementation of A PP Cooperative Proxy Cache System James Z. Wang Vipul Bhulawala Department of Computer Science Clemson University, Box 40974 Clemson, SC 94-0974, USA +1-84--778 {jzwang,

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

OPERATING SYSTEM. Chapter 12: File System Implementation

OPERATING SYSTEM. Chapter 12: File System Implementation OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management

More information

Distributed OrcaFlex. 1. Introduction. 2. What s New. Distributed OrcaFlex

Distributed OrcaFlex. 1. Introduction. 2. What s New. Distributed OrcaFlex 1. Introduction is a suite of programs that enables a collection of networked, OrcaFlex licensed, computers to run OrcaFlex jobs as background tasks using spare processor time. consists of four separate

More information

Introduction. Table of Contents

Introduction. Table of Contents Introduction This is an informal manual on the gpu search engine 'gpuse'. There are some other documents available, this one tries to be a practical how-to-use manual. Table of Contents Introduction...

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Lecture 7: February 10

Lecture 7: February 10 CMPSCI 677 Operating Systems Spring 2016 Lecture 7: February 10 Lecturer: Prashant Shenoy Scribe: Tao Sun 7.1 Server Design Issues 7.1.1 Server Design There are two types of server design choices: Iterative

More information

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid

More information

RACS: Extended Version in Java Gary Zibrat gdz4

RACS: Extended Version in Java Gary Zibrat gdz4 RACS: Extended Version in Java Gary Zibrat gdz4 Abstract Cloud storage is becoming increasingly popular and cheap. It is convenient for companies to simply store their data online so that they don t have

More information

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL 5.1 INTRODUCTION The survey presented in Chapter 1 has shown that Model based testing approach for automatic generation of test

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

BIG-IP Local Traffic Management: Basics. Version 12.1

BIG-IP Local Traffic Management: Basics. Version 12.1 BIG-IP Local Traffic Management: Basics Version 12.1 Table of Contents Table of Contents Introduction to Local Traffic Management...7 About local traffic management...7 About the network map...7 Viewing

More information

PROJECT REPORT Thang Tran ( ), Sandeep Wadhwani ( ) and Vineet Kumar ( ) Introduction

PROJECT REPORT Thang Tran ( ), Sandeep Wadhwani ( ) and Vineet Kumar ( ) Introduction PROJECT REPORT Thang Tran (264635), Sandeep Wadhwani (26476122) and Vineet Kumar (26461756) Introduction Zeromq Chat Application Our Zeromq Chat Application provides peer-to-peer instant messaging for

More information

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,

More information

A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach

A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach Shridhar Diwan, Dennis Gannon Department of Computer Science Indiana University Bloomington,

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Computer Systems Engineering: Spring Quiz I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Computer Systems Engineering: Spring Quiz I Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.033 Computer Systems Engineering: Spring 2016 Quiz I There are 15 questions and 13 pages in this quiz booklet.

More information

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled. Operating Systems: Internals and Design Principles Chapter 4 Threads Seventh Edition By William Stallings Operating Systems: Internals and Design Principles The basic idea is that the several components

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Objectives of Chapter To provide a grand tour of the major computer system components:

More information

White paper ETERNUS Extreme Cache Performance and Use

White paper ETERNUS Extreme Cache Performance and Use White paper ETERNUS Extreme Cache Performance and Use The Extreme Cache feature provides the ETERNUS DX500 S3 and DX600 S3 Storage Arrays with an effective flash based performance accelerator for regions

More information

CSC 553 Operating Systems

CSC 553 Operating Systems CSC 553 Operating Systems Lecture 1- Computer System Overview Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users Manages secondary memory

More information

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 Question 344 Points 444 Points Score 1 10 10 2 10 10 3 20 20 4 20 10 5 20 20 6 20 10 7-20 Total: 100 100 Instructions: 1. Question

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

User Manual. Version 1.0. Submitted in partial fulfillment of the Masters of Software Engineering degree.

User Manual. Version 1.0. Submitted in partial fulfillment of the Masters of Software Engineering degree. User Manual For KDD-Research Entity Search Tool (KREST) Version 1.0 Submitted in partial fulfillment of the Masters of Software Engineering degree. Eric Davis CIS 895 MSE Project Department of Computing

More information

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups).

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups). 1/21/2015 CS 739 Distributed Systems Fall 2014 PmWIki / Project1 PmWIki / Project1 The goal for this project is to implement a small distributed system, to get experience in cooperatively developing a

More information

Lab 2: Threads and Processes

Lab 2: Threads and Processes CS333: Operating Systems Lab Lab 2: Threads and Processes Goal The goal of this lab is to get you comfortable with writing basic multi-process / multi-threaded applications, and understanding their performance.

More information

Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller College of Computing, Georgia Institute of Technology, Atlanta,

More information

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group WHITE PAPER: BEST PRACTICES Sizing and Scalability Recommendations for Symantec Rev 2.2 Symantec Enterprise Security Solutions Group White Paper: Symantec Best Practices Contents Introduction... 4 The

More information

A Unix process s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region

A Unix process s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region A Unix process s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region consisting of initialized data (simply called data), uninitialized

More information

Software Architecture

Software Architecture Software Architecture Mestrado em Engenharia Informática e de Computadores COMPANION TO THE FIRST EXAM ON JANUARY 8TH, 2016 VERSION: A (You do not need to turn in this set of pages with your exam) 1. Consider

More information

PERFORMANCE MEASUREMENT OF WORLD WIDE WEB SERVERS

PERFORMANCE MEASUREMENT OF WORLD WIDE WEB SERVERS PERFORMANCE MEASUREMENT OF WORLD WIDE WEB SERVERS Cristina Hava & Liam Murphy 1 Abstract The World Wide Web (WWW, or Web) is one of the most important Internet services, and has been largely responsible

More information

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017 Liang 1 Jintian Liang CS244B December 13, 2017 1 Introduction FAWN as a Service FAWN, an acronym for Fast Array of Wimpy Nodes, is a distributed cluster of inexpensive nodes designed to give users a view

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

CSci 4061 Introduction to Operating Systems. (Thread-Basics)

CSci 4061 Introduction to Operating Systems. (Thread-Basics) CSci 4061 Introduction to Operating Systems (Thread-Basics) Threads Abstraction: for an executing instruction stream Threads exist within a process and share its resources (i.e. memory) But, thread has

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.

More information

Scalable overlay Networks

Scalable overlay Networks overlay Networks Dr. Samu Varjonen 1 Lectures MO 15.01. C122 Introduction. Exercises. Motivation. TH 18.01. DK117 Unstructured networks I MO 22.01. C122 Unstructured networks II TH 25.01. DK117 Bittorrent

More information

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide April 2013 SSIM Engineering Team Version 3.0 1 Document revision history Date Revision Description of Change Originator 03/20/2013

More information

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation

More information

Automated Path Ascend Forum Crawling

Automated Path Ascend Forum Crawling Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information