Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Similar documents
Enhancement to PeerCrawl: A Decentralized P2P Architecture for Web Crawling

Peer-to-Peer Systems. Chapter General Characteristics

CS 347 Parallel and Distributed Data Processing

PeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

CS 347 Parallel and Distributed Data Processing

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

CIS 700/005 Networking Meets Databases

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Peer-to-Peer Systems and Distributed Hash Tables

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

Assignment 5. Georgia Koloniari

Architectures for Distributed Systems

CompSci 356: Computer Network Architectures Lecture 21: Overlay Networks Chap 9.4. Xiaowei Yang

Efficient Resource Management for the P2P Web Caching

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Distributed Meta-data Servers: Architecture and Design. Sarah Sharafkandi David H.C. Du DISC

Collection Building on the Web. Basic Algorithm

Turbo King: Framework for Large- Scale Internet Delay Measurements

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Distributed Knowledge Organization and Peer-to-Peer Networks

CS47300: Web Information Search and Management

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

Peer-to-peer computing research a fad?

Information Retrieval. Lecture 10 - Web crawling

Building an Internet-Scale Publish/Subscribe System

Ultrapeer-Leaf Degree 10. Number of Ultrapeers. % nodes. Ultrapeer-Ultrapeer Degree Leaf-Ultrapeer Degree TTL

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

CS47300: Web Information Search and Management

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Distributed Hash Tables

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Exploiting Route Redundancy via Structured Peer to Peer Overlays

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Venugopal Ramasubramanian Emin Gün Sirer SIGCOMM 04

Telematics Chapter 9: Peer-to-Peer Networks

Handling Churn in a DHT

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

Distributed Hash Tables: Chord

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

CS 640 Introduction to Computer Networks. Today s lecture. What is P2P? Lecture30. Peer to peer applications

Summary Cache based Co-operative Proxies

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

Scaling Data Center Application Infrastructure. Gary Orenstein, Gear6

Scalable overlay Networks

CSE 124 Finding objects in distributed systems: Distributed hash tables and consistent hashing. March 8, 2016 Prof. George Porter

Content Overlays. Nick Feamster CS 7260 March 12, 2007

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Path Optimization in Stream-Based Overlay Networks

Introduction to Peer-to-Peer Systems

EARM: An Efficient and Adaptive File Replication with Consistency Maintenance in P2P Systems.

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

Search Engines. Information Retrieval in Practice

ALTO Problem Statement

A Hybrid Architecture for Massively Multiplayer Online Games

Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

DATA MINING - 1DL105, 1DL111

Chapter 6 PEER-TO-PEER COMPUTING

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Overlay Networks. Behnam Momeni Computer Engineering Department Sharif University of Technology

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

A Hybrid Structured-Unstructured P2P Search Infrastructure

Drafting Behind Akamai (Travelocity-Based Detouring)

Veracity: Practical Secure Network Coordinates via Vote-Based Agreements

Performing MapReduce on Data Centers with Hierarchical Structures

A Scalable Content- Addressable Network

Peer-to-Peer Systems. Network Science: Introduction. P2P History: P2P History: 1999 today

PEER-TO-PEER (P2P) systems are now one of the most

A Software Tool for Network Intrusion Detection

Unit 8 Peer-to-Peer Networking

CS514: Intermediate Course in Computer Systems

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

Information Retrieval

Replication, Load Balancing and Efficient Range Query Processing in DHTs

Load Balancing Algorithm over a Distributed Cloud Network

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

DATA MINING II - 1DL460. Spring 2014"

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

416 Distributed Systems. March 23, 2018 CDNs

Goals. EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Solution. Overlay Networks: Motivations.

TECHNISCHE UNIVERSITEIT EINDHOVEN Faculteit Wiskunde en Informatica

Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation

Chapter 6: Distributed Systems: The Web. Fall 2012 Sini Ruohomaa Slides joint work with Jussi Kangasharju et al.

CHAPTER 3 EFFECTIVE ADMISSION CONTROL MECHANISM IN WIRELESS MESH NETWORKS

CSE 124: Networked Services Lecture-17

DIBS: Just-in-time congestion mitigation for Data Centers

CS 655 Advanced Topics in Distributed Systems

Huge market -- essentially all high performance databases work this way

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

A Solution for Geographic Regions Load Balancing in Cloud Computing Environment

Topic 6: SDN in practice: Microsoft's SWAN. Student: Miladinovic Djordje Date:

Transcription:

Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Search Today Search Index Crawl

What s Wrong? Users have a limited search interface Today s web is dynamic and growing: Timely re-crawls required. Not feasible for all web sites. Search engines control your search results: Decide which sites get crawled: 550 billion documents estimated in 2001 (BrightPlanet) Google indexes 3.3 billion documents. Decide which sites gets updated more frequently May censor or skew results rankings. Challenge: User customizable searches that scale.

Our Solution: A Distributed Crawler P2P users donate excess bandwidth and computation resources to crawl the web. Organized using Distributed Hash tables (DHTs) DHT and Query Processor agnostic crawler: Designed to work over any DHT Crawls can be expressed as declarative recursive queries Easy for user customization. Queries can be executed over PIER, a DHT-based relational P2P Query Processor Crawlees: Web Servers Crawlers: PIER nodes

Potential Infrastructure for crawl personalization: User-defined focused crawlers Collaborative crawling/filtering (special interest groups) Other possibilities: Bigger, better, faster web crawler Enables new search and indexing technologies P2P Web Search Web archival and storage (with OceanStore) Generalized crawler for querying distributed graph structures. Monitor file-sharing networks. E.g. Gnutella. P2P network maintenance: Routing information. OceanStore meta-data.

Challenges that We Investigated Scalability and Throughput DHT communication overheads. Balance network load on crawlers 2 components of network load: Download and DHT bandwidth. Network Proximity. Exploit network locality of crawlers. Limit download rates on web sites Prevents denial of service attacks. Main tradeoff: Tension between coordination and communication Balance load either on crawlers or on crawlees! Exploit network proximity at the cost of communication.

Crawl as a Recursive Query Publish WebPage(url) Publish Link (sourceurl, desturl) Π: Link.destUrl WebPage(url) Seed Urls Rate Throttle & Reorder Filters Dup Elim CrawlWrapper DupElim Crawler Thread Output Links Input Urls Redirect Extractor Downloader DHT Scan: WebPage(url)

Crawl Distribution Strategies Partition by URL Ensures even distribution of crawler workload. High DHT communication traffic. Partition by Hostname One crawler per hostname. Creates a control point for per-server rate throttling. May lead to uneven crawler load distribution Single point of failure: Bad choice of crawler affects per-site crawl throughput. Slight variation: X crawlers per hostname.

Redirection Simple technique that allows a crawler to redirect or pass on its assigned work to another crawler (and so on.) A second chance distribution mechanism orthogonal to the partitioning scheme. Example: Partition by hostname Node responsible for google.com (red) dispatches work (by URL) to grey nodes Load balancing benefits of partition by URL Control benefits of partition by hostname When? Policy-based Crawler load (queue size) Network proximity Why not? Cost of redirection Increased DHT control traffic Hence, put a limit number of redirections per URL. www.google.com

Experiments Deployment WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes 3 Crawl Threads per crawler, 15 min crawl duration Distribution (Partition) Schemes URL Hostname Hostname with 8 crawlers per unique host Hostname, one level redirection on overload. Crawl Workload Exhaustive crawl Seed URL: http://www.google.com 78244 different web servers Crawl of fixed number of sites Seed URL: http://www.google.com 45 web servers within google Crawl of single site within http://groups.google.com

Crawl of Multiple Sites I CDF of Per-crawler Downloads (80 nodes) Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busy Crawl Throughput Scaleup Hostname: Can only exploit at most 45 crawlers. Redirect (hybrid hostname/url) does the best.

Crawl of Multiple Sites II Per-URL DHT Overheads Redirect: The per-url DHT overheads hit their maximum around 70 nodes. Redirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.

Network Proximity Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts Partition by hostname approximates random assignment Best-3 random is close enough to Best-5 random Sanity check: what if a single host crawls all targets?

Summary of Schemes Loadbalance download bandwidth Loadbalance DHT bandwidth Rate limit Crawlees Network proximity DHT Communication overheads URL + + - - - Hostname - - +? + Redirect +? + + --

Related Work Herodotus, at MIT (Chord-based) Partition by URL Batching with ring-based forwarding. Experimented on 4 local machines Apoidea, at GaTech (Chord-based) Partition by hostname. Forwards crawl to DHT neighbor closest to website. Experimented on 12 local machines.

Conclusion Our main contributions: Propose a DHT and QP agnostic Distributed Crawler. Express crawl as a query. Permits user-customizable refinement of crawls Discover important trade-offs in distributed crawling: Co-ordination comes with extra communication costs Deployment and experimentation on PlanetLab. Examine crawl distribution strategies under different workloads on live web sources Measure the potential benefits of network proximity.

Backup slides

Existing Crawlers Cluster-based crawlers Google: Centralized dispatcher sends urls to be crawled. Hash-based parallel crawlers. Focused Crawlers BINGO! Crawls the web given basic training set. Peer-to-Peer Grub SETI@Home infrastructure. 23993 members.

Exhaustive Crawl Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads. Little difference in throughput. Most crawler threads are kept busy.

Single Site URL is best, followed by redirect and hostname.

Future Work Fault Tolerance Security Single-Node Throughput Work-Sharing between Crawl Queries Essential for overlapping users. Crawl Global Prioritization A requirement of personalized crawls. Online relevance feedback. Deep web retrieval.