Worse Is Better. Vijay Gill/Bikash Koley/Paul Schultz for Google Network Architecture Google. June 14, 2010 NANOG 49, San Francisco

Similar documents
COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

Data Centers. Tom Anderson

image credit Fabien Hermenier Cloud compting 101

image credit Fabien Hermenier Cloud compting 101

Datacenters. Mendel Rosenblum. CS142 Lecture Notes - Datacenters

Data Centers and Cloud Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Centers and Cloud Computing. Data Centers

image credit Fabien Hermenier Cloud compting 101

Availability, Reliability, and Fault Tolerance

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler, focused crawling. 4 Web search vs Browsing. 5 Privacy, Filter bubble.

Lessons Learned While Building Infrastructure Software at Google

Building Core Networks and Routers in the 2002 Economy

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler. 4 Focused Web crawling. 5 Web search vs Browsing. 6 Privacy, Filter bubble

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Resilient IP Backbones. Debanjan Saha Tellium, Inc.

CSE 124: THE DATACENTER AS A COMPUTER. George Porter November 20 and 22, 2017

VIRTUAL CLUSTER SWITCHING SWITCHES AS A CLOUD FOR THE VIRTUAL DATA CENTER. Emil Kacperek Systems Engineer Brocade Communication Systems.

Introduction to Cisco ASR 9000 Series Network Virtualization Technology

Warehouse-Scale Computing

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Will it one day be green to shun the internet?

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

BROCADE CLOUD-OPTIMIZED NETWORKING: THE BLUEPRINT FOR THE SOFTWARE-DEFINED NETWORK

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Next Generation Broadband Networks

GFS: The Google File System. Dr. Yingwu Zhu

The Impact of Optics on HPC System Interconnects

Building Large-Scale Internet Services. Jeff Dean

CSE 291: Data Center Networking. Spring 2015 Tu/Th 8:00-9:20am George Porter UC San Diego

Instructors: Randy H. Katz David A. PaHerson hhp://inst.eecs.berkeley.edu/~cs61c/fa10. Fall Lecture #9. Agenda

Internet Traffic Characteristics. How to take care of the Bursty IP traffic in Optical Networks

Increasing Performance of Existing Oracle RAC up to 10X

Data Center Fundamentals: The Datacenter as a Computer

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Oracle Database 10G. Lindsey M. Pickle, Jr. Senior Solution Specialist Database Technologies Oracle Corporation

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

BT Connect Networks that think Optical Connect UK

Lecture 20: WSC, Datacenters. Topics: warehouse-scale computing and datacenters (Sections )

Data Center Fundamentals: The Datacenter as a Computer

Optical + Ethernet: Converging the Transport Network. An Overview

HP Servers. HP ProLiant DL580 Gen8 Server

Multiservice Optical Switching System CoreDirector FS. Offering new services and enhancing service velocity

2 to 4 Intel Xeon Processor E v3 Family CPUs. Up to 12 SFF Disk Drives for Appliance Model. Up to 6 TB of Main Memory (with GB LRDIMMs)

TITLE. the IT Landscape

Alcatel 1671 Service Connect

Flattening the Network

The Future of High Performance Interconnects

SIMPLE, FLEXIBLE CONNECTIONS FOR TODAY S BUSINESS. Ethernet Services from Verizon

Lecture 2 Distributed Filesystems

GFS: The Google File System

Lambda Networks DWDM. Vara Varavithya Department of Electrical Engineering King Mongkut s Institute of Technology North Bangkok

FAST SQL SERVER BACKUP AND RESTORE

Extending the Benefits of GDDR Beyond Graphics

Themes. The Network 1. Energy in the DC: ~15% network? Energy by Technology

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Lecture 15: Datacenter TCP"

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Ericsson ip transport nms

Outlines. Introduction (Cont d) Introduction. Introduction Network Evolution External Connectivity Software Control Experience Conclusion & Discussion

GÉANT Network Evolution

Core Networks Evolution

A Computer Scientist Looks at the Energy Problem

Strategy for SWITCH's next generation optical network

Data Center Applications and MRV Solutions

How Smart Networks are changing the Corporate WAN

InfiniBand and Next Generation Enterprise Networks

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Network Layer Flow Control via Credit Buffering

Optical considerations for nextgeneration

Green Computing: Datacentres

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

Vodafone keynote. How smart networks are changing the corporate WAN. Peter Terry Brown Director of Connectivity & UC.

White Paper. Massive Capacity Can Be Easier with 4G-Optimized Microwave Backhaul

EMC Forum 2012 LAISSEZ-VOUS PORTER PAR LE CLOUD LA PUISSANCE QUI TRANSFORME LES RÉSEAUX DU DATA CENTER. Michel ASSAD / Alain HUGUET Novembre, 2012

Lecture 9: MIMD Architectures

Introduction: PURPOSE BUILT HARDWARE. ARISTA WHITE PAPER HPC Deployment Scenarios

THE EXPONENTIAL DATA CENTER

Techniques and Protocols for Improving Network Availability

Lecture 9: MIMD Architecture

Functional Requirements for Grid Oriented Optical Networks

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

Designing and Provisioning IP Contribution Networks

Network Design Clinic

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

AGILE OPTICAL NETWORK (AON) CORE NETWORK USE CASES

Large-Scale Web Applications

The Software Driven Datacenter

The Impact of Hyper- converged Infrastructure on the IT Landscape

Data Center TCP (DCTCP)

Japan Network. The Level 3 Global Network. Trans-Pacific and Asian Networks. Outline. The Level 3 Network. Metropolitan. Transport Network.

Next Generation Networks MultiService Network Design. Dr. Ben Tang

Nimbra - communication platform for the SmartGRID

Cache and Forward Architecture

Erate 2.1 Category 1 Financials

Cisco Nexus 4000 Series Switches for IBM BladeCenter

THE NETWORK AND THE CLOUD

Carrier SDN for Multilayer Control

Transcription:

Worse Is Better Vijay Gill/Bikash Koley/Paul Schultz for Google Network Architecture Google June 14, 2010 NANOG 49, San Francisco

Worse is Better You see, everybody else is too afraid of looking stupid because they just can t keep enough facts in their head at once to make multiple inheritance, or multithreading, or any of that stuff work. So they sheepishly go along with whatever faddish programming network craziness has come down from the architecture astronauts who speak at conferences and write books and articles and are so much smarter than us that they don t realize that the stuff that they re promoting is too hard for us. Google Confidential and Proprietary

Guiding Principle Important not to try to be all things to all people Clients might be demanding 8 different things Doing 6 of them is easy handling 7 of them requires real thought dealing with all 8 usually results in a worse system more complex, compromises other clients in trying to satisfy everyone Don't build infrastructure just for its own sake Identify common needs and address them Don't imagine unlikely potential needs that aren't really there Best approach: use your own infrastructure (especially at first!) (much more rapid feedback about what works, what doesn't) Google Confidential and Proprietary

Google s Mission To organize the world's information and make it UNIVERSALLY accessible and useful. Google Confidential and Proprietary

Scale Scale Scale User base World population: 6.676 billion people Internet Users: 1.463 billion (>20%) Google Search: More than a billion searches daily Geographical Distribution (June'08, US Census est.) (June'08, Nielsen/ITU) Google services are worldwide: over 55 countries and 112 languages More than half our searches come from outside the U.S. Data Growth Web expands/changes: billions of new/modified pages every month Every few hours we crawl/refresh more than whole Library of Congress YouTube gains over 13 15 18 24 hours of video every minute, 1+ billion views a day Latency Challenge Speed of Light in glass: 2 x 10^8 m/s = 2,000 km / 10 ms Blink of an eye response = 100 ms

Warehouse Scale Computers Consolidated Computing, Many UIs, Many Apps, Many Locations 6 Luiz André Barroso, Urs Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, http://www.morganclaypool.com/doi/abs/10.2200/s00193ed1v01y200905cac006?prevsearch=allfield%253a%2528urs%2529&searchhistorykey=

Warehouse-size Computer Warehouse-scale Computer

The Machinery Servers CPUs DRAM Disks Clusters Racks 40-80 servers Ethernet switch

Example (small!) storage hierarchy P L1$ L2$ P L1$ P L1$ L2$ P L1$ One server DRAM: 16GB, 100ns, 20GB/s Disk: 2TB, 10ms, 200MB/s Local DRAM Disk Rack Switch DRAM DRAM DRAM DRAM Disk Disk Disk Disk DRAM DRAM DRAM DRAM Disk Disk Disk Disk Local rack (80 servers) DRAM: 1TB, 300us, 100MB/s Disk: 160TB, 11ms, 100MB/s Datacenter Switch Cluster (30 racks) DRAM: 30TB, 500us, 10MB/s Disk: 4.80PB, 12ms, 10MB/s Google Confidential and Proprietary

Storage Hierarchy (A different View) Google Confidential and Proprietary

Communication performance? Big advantage in communication performance if computation fits in 1 system Performance edge of big iron systems deteriorates if used in very large clusters Execution time = 1ms + f*(100ns*1/#nodes + 100us*(1 1/#nodes)) Google Confidential and Proprietary

Reliability & Availability Things will crash. Deal with it! Take super reliable servers (MTBF of 30 years) Build a machine with 10 thousand of those Watch one fail per day Fault-tolerant software is inevitable Typical yearly flakiness metrics 1-5% of your disk drives will die Servers will crash at least twice (2-4% failure rate) Internet is not a five nines fabric ~99% availability, varying significantly with geography Wild dogs, sharks, dead horses, thieves, blasphemy, drunken hunters Very strange bugs are common when you have lots of gear

The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.

Hard drive failures (48000 disks)(0.04 fail/y)(6 h)/(8766 h/y) = 1.3 disk failures per run Must design assuming drives will fail! Graph source: Pinheiro, Weber, Barroso: FAST 07 Google Confidential and Proprietary

Internet Landscape is Changing 2009 Internet Observatory Report http://www.nanog.org/meetings/nanog47/presentations/monday/labovitz_observereport_n47_mon.pdf

Layer Cake Service Layer Massively Scalable, Highly Dynamic. Services Drive the Network. Application Layer Control Pushed Down Into Network. IP Layer Standardized, Resilient and Universal Compute Interconnect and Service Delivery MPLS Layer Forwarding, TE, Fast Restoral Optical Layer Cost-effective simple, high-bw, point-to-point connectivity Google Confidential and Proprietary

Topology 19

Traffic Classes Two types of traffic 1. User ( I ) Predominantly user initiated 2. Machine-to-machine ( M ) Bulk, predominantly machine to machine Different SLA and requirements Different scale of traffic

Converged vs Overlay? 1. Converged Layer-1 (common transport infrastructure) Layer-3 (common routing and transport infrastructure) 2. Overlay Same fiber path, different transport with appropriate optimization for each layer-3 21

Converged L-1 Network

Converged L-1+L-3

Converged L3 Explained Separate L3 for I class and M class traffic M DC_A X Y DC_B i i i Converged L3 for I class and M class traffic DC_A X Y DC_B Better utilization of transport pipes through stat-muxing of I class and M class traffic

Two LSR Junction Options 1. LSR with ITU integration 2. LSR without ITU integration E-O-O-E E-O-O-E TRANSPORT LSR TRANSPORT

Integrated LSR: 2 types Loose Integration : LSR + standalone DWDM equipment; cheap client optics integration Tight Integration : LSR with Integrated DWDM; Cost of buffering, sync and clocking eliminated 26 Google Confidential and Proprietary

Overlay L-1/L-3 Networks

I-scale LSR and M-scale Overlay

Assumptions/ Parameters Used This analysis takes into account of CapEx only Projected internet-scale traffic based on current traffic pattern is used Data_center-pair Machine-scale demand is a parameter in this analysis Availability of 100Gbps transport technology is assumed Availability of cost-effective LSR with/without optical integration is also assumed Cost models: (normalized on a $ Per 10Gbps basis)* 10G Port Type Port Cost Internet Scale Router 600 Merchant Silicon Scale Router 30 Merchant Silicon Scale LSR 30 Internet Scale LSR (no DWDM) 120 Internet Scale LSR with Transport Framing (ITU lambda not included) 180 DWDM Terminal Interface 400 DWDM Regen Interface 150 Gray Client Interface 10 *Pricing exercise done using average street price for the equipment where available

Different scales for each graph (absolute cost) Machine Traffic: X Tbps, User Traffic: T Tbps Machine Traffic: 2*X Tbps, User Traffic: T Tbps Google Confidential and Proprietary

Conclusions Google s web-services generate large amounts of Traffic (More Internal Than External) Traffic volumes make cost efficiency a primary metric Cost-effective LSRs provide the simplest way to effectively stat-mux traffic and add/drop However, DWDM HAS to be integrated with LSR functionality Depending on add/drop locations and relative scale of I and M traffic, either a flat DWDM-LSR based network or an LSR based network with optical express are the ways to go A simple unified end-to-end control plane at the highest network layer (IP/MPLS) is a MUST No Need For Additional Control Plane At Optical Layer Other Layers Provide Necessary Functionality Google Confidential and Proprietary

Thank You!