Performance Analysis for Crawling

Similar documents
Files/News/Software Distribution on Demand. Replicated Internet Sites. Means of Content Distribution. Computer Networks 11/9/2009

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Large-Scale Web Applications

Shen, Tang, Yang, and Chu

CS November 2018

CS November 2017

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Data Centers. Tom Anderson

Venugopal Ramasubramanian Emin Gün Sirer SIGCOMM 04

Introduction. Network Architecture Requirements of Data Centers in the Cloud Computing Era

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Lecture 8: Internet and Online Services. CS 598: Advanced Internetworking Matthew Caesar March 3, 2011

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

CSC2231: DNS with DHTs

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Mul$media Networking. #9 CDN Solu$ons Semester Ganjil 2012 PTIIK Universitas Brawijaya

Distributed Systems Principles and Paradigms. Chapter 12: Distributed Web-Based Systems

THE WEB SEARCH ENGINE

CS November 2017

Content Distribution. Today. l Challenges of content delivery l Content distribution networks l CDN through an example

Remote Procedure Call. Tom Anderson

Distributed Systems. 21. Content Delivery Networks (CDN) Paul Krzyzanowski. Rutgers University. Fall 2018

CS November 2018

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

DEPLOYMENT GUIDE Version 1.1. DNS Traffic Management using the BIG-IP Local Traffic Manager

Application-layer Protocols

Streaming Video over the Internet. Dr. Dapeng Wu University of Florida Department of Electrical and Computer Engineering

Yahoo Traffic Server -a Powerful Cloud Gatekeeper

CSE 5306 Distributed Systems

Part 12 殷亚凤. Homepage: Room 301, Building of Computer Science and Technology

What s Virtual Memory Management. Virtual Memory Management: TLB Prefetching & Page Walk. Memory Management Unit (MMU) Problems to be handled

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

SJTU 2018 Fall Computer Networking. Wireless Communication

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Load Balancing Technology White Paper

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Data Representation and Indexing

DATA MINING - 1DL105, 1DL111

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

«Computer Science» Requirements for applicants by Innopolis University

Sun N1: Storage Virtualization and Oracle

Indexing Large-Scale Data

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

CSE 530A. B+ Trees. Washington University Fall 2013

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Introduction to MySQL Cluster: Architecture and Use

Server Selection Mechanism. Server Selection Policy. Content Distribution Network. Content Distribution Networks. Proactive content replication

Media Pack. Home About User Benefits Advertiser Benefits Advertising Packages

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

NFSv4 as the Building Block for Fault Tolerant Applications

Search Engines. Information Retrieval in Practice

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

CSC 4900 Computer Networks: Link Layer (3)

ScaleArc for SQL Server

Internet Protocol Stack! Principles of Network Applications! Some Network Apps" (and Their Protocols)! Application-Layer Protocols! Our goals:!

Dynamic Metadata Management for Petabyte-scale File Systems

Internet Content Distribution

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups).

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Uniform Resource Locators (URL)

CONTENT DISTRIBUTION. Oliver Michel University of Illinois at Urbana-Champaign. October 25th, 2011

System Specification

More on Testing and Large Scale Web Apps

LegUp: Accelerating Memcached on Cloud FPGAs

RIGHTNOW A C E

Fault-Tolerant and Scalable TCP Splice and Web Server Architecture ; CU-CS

Monitor your infrastructure with the Elastic Beats. Monica Sarbu

Scaling Web Service. Capacity planning. CS144: Web Applications

CSE 5306 Distributed Systems. Naming

Next-Generation Cloud Platform

Search Quality. Jan Pedersen 10 September 2007

CSE 124: Networked Services Lecture-16

What is the relationship between a domain name (e.g., youtube.com) and an IP address?

Networking in a Vertically Scaled World

Advanced Computer Networking. CYBR 230 Jeff Shafer University of the Pacific. Project 2

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

Lecture 18 Overview. Last Lecture. This Lecture. Next Lecture. Internet Protocol (1) Internet Protocol (2)

Got Isilon? Need IOPS? Get Avere.

Improve Web Application Performance with Zend Platform

No, the bogus packet will fail the integrity check (which uses a shared MAC key).!

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Warehouse-Scale Computing

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

CSE 124: Networked Services Fall 2009 Lecture-19

Lecture 16: Router Design

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

The Google File System

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Fundamentals of Windows Server 2008 Network and Applications Infrastructure

BIG-IP DNS: Monitors Reference. Version 12.1

Transcription:

Scalable Servers and Load Balancing Kai Shen Online Applications online applications Applications accessible to online users through. Examples Online keyword search engine: Google. Web email: Gmail. News: CNN, NBC news. Web directory: Yahoo!, MSN. Scalability requirements Many simultaneous user accesses; large amount of hosted data, servers Computer systems that host these online applications 11/11/2009 CSC 257/457 - Fall 2009 1 11/11/2009 CSC 257/457 - Fall 2009 2 Servers are at the Application Layer Normally on the end hosts, involving no routers Function on transport-layer protocols TCP/UDP Step 1 Crawling Crawling get all these Web pages out there: First retrieve some root pages; Parse their content and follow hyperlinks to retrieve more pages; Depth-first search or breadth-first search? Remove duplicates. Google Yahoo! CNN 11/11/2009 CSC 257/457 - Fall 2009 3 11/11/2009 CSC 257/457 - Fall 2009 4 CSC 257/457 - Fall 2009 1

Performance Analysis for Crawling What are the resources involved? CPU processing for TCP/HTTP protocol handling and the parsing of page content writing to disk storage network bandwidth to remote web sites Assume average page size 10KB raw processing power of a single CPU 1000 requests/sec I/O to a single disk 100 seeks/sec up to 100 requests/sec network bandwidth from/to the T1 link (1.5Mbit/s) 12 requests/sec T3 link (45Mbit/s) 360 requests/sec Step 2 Indexing Indexing crawled raw web pages are not easy to search. we index them to formats that are easy to search. As part of indexing, we need to give each page an ID using a hash function. Computer: Page #123 Page #357 Networks: Page #124 Page #468 11/11/2009 CSC 257/457 - Fall 2009 5 11/11/2009 CSC 257/457 - Fall 2009 6 Step 3 Online Search Index server Partitioning and Replication Index servers (partition 1) Firewall Web server/ Query handler Local-areaarea network Local-areaarea network Index servers (partition 2) Page server Web server/ Query handlers Scalability, reliability 11/11/2009 CSC 257/457 - Fall 2009 7 Page servers 11/11/2009 CSC 257/457 - Fall 2009 8 CSC 257/457 - Fall 2009 2

Load Balancing over Servers Popular sites like Google or CNN receive tens or hundreds of millions of hits per day. A large number of replicated servers are used at these sites. Key question: how to balance client requests over these servers? Load Balancing on Servers Technique 1 - DNS Rotation IP address of CNN.com? IP address of CNN.com? DNS server 11/11/2009 CSC 257/457 - Fall 2009 9 11/11/2009 CSC 257/457 - Fall 2009 10 Discussions on DNS Rotation Advantages Require almost no change on the existing architecture Problems DNS Caching Rigid load balancing policy can t balance based on runtime load changes slow or no adjustment in response to failures Load Balancing on Servers Technique 2 Cooperative Offloading 11/11/2009 CSC 257/457 - Fall 2009 11 11/11/2009 CSC 257/457 - Fall 2009 12 CSC 257/457 - Fall 2009 3

Discussions on Cooperative Offloading Can be combined with the DNS rotation. Advantages: More flexible policy is possible Be more responsive to runtime workload and server failures (to a certain degree) Problems: Need software changes on servers Longer delay 11/11/2009 CSC 257/457 - Fall 2009 13 Cooperative Offloading with TCP Handoff [Pai et al. ASPLOS1998] What does 1.3 do? What does 1.4 do? 1.3 1.4 13 1.3 All packets in a TCP connection must offload to one server? 11/11/2009 CSC 257/457 - Fall 2009 14 Cooperative Offloading vs. TCP Handoff Software changes on the servers Delays Load Balancing on Servers Technique 3 Load Balancing 1.1 1.2 Firewall LB 1.2 1.1 128.111.1.1 11/11/2009 CSC 257/457 - Fall 2009 15 11/11/2009 CSC 257/457 - Fall 2009 16 CSC 257/457 - Fall 2009 4

More About Load Balancing Summary How deep do we look into the network protocol stack? Network layer (IP)? Transport layer (TCP/UDP)? Application layer? Load balancing policies in LB routers (Goal: transparency, plug-and-play) Simple rotation Least number of active requests Shortest response time Scalable servers partitioning replication Load balancing for servers DNS rotation cooperative offloading (w. TCP handoff) Load balancing router Changes required on the components: DNS server?? Web server?? client?? router?? 11/11/2009 CSC 257/457 - Fall 2009 17 11/11/2009 CSC 257/457 - Fall 2009 18 CSC 257/457 - Fall 2009 5