Data Centers. Tom Anderson

Similar documents
Remote Procedure Call. Tom Anderson

Datacenters. Mendel Rosenblum. CS142 Lecture Notes - Datacenters

1/10/16. RPC and Clocks. Tom Anderson. Last Time. Synchroniza>on RPC. Lab 1 RPC

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

Data Centers and Cloud Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Worse Is Better. Vijay Gill/Bikash Koley/Paul Schultz for Google Network Architecture Google. June 14, 2010 NANOG 49, San Francisco

Data Centers and Cloud Computing. Data Centers

Availability, Reliability, and Fault Tolerance

image credit Fabien Hermenier Cloud compting 101

Routing Domains in Data Centre Networks. Morteza Kheirkhah. Informatics Department University of Sussex. Multi-Service Networks July 2011

image credit Fabien Hermenier Cloud compting 101

Distributed File Systems II

Large-Scale Web Applications

Distributed System. Gang Wu. Spring,2018

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Utilizing Datacenter Networks: Centralized or Distributed Solutions?

More on Testing and Large Scale Web Apps

Advanced Computer Networks Exercise Session 7. Qin Yin Spring Semester 2013

Themes. The Network 1. Energy in the DC: ~15% network? Energy by Technology

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

ICS 351: Networking Protocols

Chapter 8 Fault Tolerance

Take Back Lost Revenue by Activating Virtuozzo Storage Today

Distributed Filesystem

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler, focused crawling. 4 Web search vs Browsing. 5 Privacy, Filter bubble.

MODELS OF DISTRIBUTED SYSTEMS

Lecture 2 Distributed Filesystems

Performance Analysis for Crawling

Introduction. Network Architecture Requirements of Data Centers in the Cloud Computing Era

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Lecture 3 Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP 2003

The Google File System

Today: Fault Tolerance. Replica Management

Memory-Based Cloud Architectures

Computer Network 2015 Mid-Term Exam.

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CSE 124: Networked Services Lecture-16

02 - Distributed Systems

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

02 - Distributed Systems

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Storage. CS 3410 Computer System Organization & Programming

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups).

Cache and Forward Architecture

McAfee Network Security Platform

Fault Tolerance. Distributed Systems IT332

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems

The Google File System

Distributed Systems Fault Tolerance

Map-Reduce. Marco Mura 2010 March, 31th

Computer Networks - Midterm

Outline. Spanner Mo/va/on. Tom Anderson

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Internet Content Distribution

6.033 Lecture Fault Tolerant Computing 3/31/2014

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

Warehouse-Scale Computing

EMC RecoverPoint. EMC RecoverPoint Support

EECS 482 Introduction to Operating Systems

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Lecture 10.1 A real SDN implementation: the Google B4 case. Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it

Network Security Platform 8.1

MODELS OF DISTRIBUTED SYSTEMS

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CSE 124: Networked Services Fall 2009 Lecture-19

Distributed Computation Models

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

CSCI 466 Midterm Networks Fall 2013

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 20: Networks and Distributed Systems

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Security (and finale) Dan Ports, CSEP 552

Introduction to Database Services

CS370 Operating Systems

precise rules that govern communication between two parties TCP/IP: the basic Internet protocols IP: Internet protocol (bottom level)

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CS November 2017

CS155b: E-Commerce. Lecture 3: Jan 16, How Does the Internet Work? Acknowledgements: S. Bradner and R. Wang

Question 1 (6 points) Compare circuit-switching and packet-switching networks based on the following criteria:

Scalability of web applications

Internet Technology. 06. Exam 1 Review Paul Krzyzanowski. Rutgers University. Spring 2016

NPTEL Course Jan K. Gopinath Indian Institute of Science

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 19: Networks and Distributed Systems

Lecture 16: Data Center Network Architectures

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

THE DATACENTER AS A COMPUTER AND COURSE REVIEW

Agenda. 1 Web search. 2 Web search engines. 3 Web robots, crawler. 4 Focused Web crawling. 5 Web search vs Browsing. 6 Privacy, Filter bubble

Cluster-Level Google How we use Colossus to improve storage efficiency

Fault Tolerance. Distributed Systems. September 2002

Page 1. Key Value Storage" System Examples" Key Values: Examples "

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Transcription:

Data Centers Tom Anderson

Transport Clarification RPC messages can be arbitrary size Ex: ok to send a tree or a hash table Can require more than one packet sent/received We assume messages can be dropped, duplicated, reordered It is packets that are dropped, duplicated, reordered Most RPCs are single packets TCP addresses some of these issues But we ll assume the general case

RPC Semantics At least once (NFS, DNS, lab 1b) true: executed at least once false: maybe executed, maybe multiple times At most once (lab 1c) true: executed once false: maybe executed, but never more than once Exactly once true: executed once false: never returns false

When does at-least-once work? If no side effects read-only operations (or idempotent ops) Example: MapReduce Example: NFS readfileblock (vs. Posix file read) writefileblock (but not: delete file)

At Most Once Client includes unique ID (UID) with each request use same UID for re-send Server RPC code detects duplicate requests return previous reply instead of re-running handler if seen[uid] { r = old[uid] } else { r = handler() old[uid] = r seen[uid] = true }

Some At-Most-Once Issues How do we ensure UID is unique? Big random number? Combine unique client ID (IP address?) with seq #? What if client crashes and restarts? Can it reuse the same UID? In labs, nodes never restart Equivalent to: every node gets new ID on start Client address + seq # is unique Seq # alone is NOT unique

When Can Server Discard Old RPCs? Option 1: Never? Option 2: unique client IDs per-client RPC sequence numbers client includes "seen all replies <= X" with every RPC Option 3: one client RPC at a time arrival of seq+1 allows server to discard all <= seq Labs use Option 3

What if Server Crashes? If at-most-once list of recent RPC results is stored in memory, server will forget and accept duplicate requests when it reboots Does server need to write the recent RPC results to disk? If replicated, does replica also need to store recent RPC results? In Labs, server gets new address on restart Client messages aren t delivered to restarted server (discard if sent to wrong version of server)

A Data Center

Inside a Data Center

Data center 10k - 100k servers: 250k 10M cores 1-100PB of DRAM 100PB - 10EB storage 1-10 Pbps bandwidth (>> Internet) 10-100MW power - 1-2% of global energy consumption 100s of millions of dollars

Servers Limits driven by the power consumption 1-4 multicore sockets 20-24 cores/socket (150W each) 100s GB 1 TB of DRAM (100-500W) 40Gbps link to network

19 wide 1.75 tall (1u) (defined in 1922!) 40-120 servers/rack network at top Servers in racks

Racks in rows

Rows in hot/cold pairs

Hot/cold pairs in data centers

Amazon, in the US: - Northern Virginia - Ohio - Oregon - Northern California Where is the cloud? Why those locations?

MTTF/MTTR Mean Time to Failure/Mean Time to Repair Disk failures (not reboots) per year ~ 2-4% At data center scale, that s about 2/hour. It takes 10 hours to restore a 10TB disk Server crashes 1/month * 30 seconds to reboot => 5 mins/year 100K+ servers

Typical Year in a Data Center (2008) ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

Typical Year in a Data Center (2008) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc

This Week: Primary Backup How do we build systems that survive a single node failure? State replication: run two copies of server primary, backup different racks, different PDUs, different DCs! If primary fails, backup takes over How do we know primary failed vs. is slow?

Data Center Networks Every server wired to a ToR (top of rack) ToR s in neighboring aisles wired to an aggregation Agg. es wired to core es

Early data center networks 3 layers of es - Edge (ToR) - Aggregation - Core

Early data center networks 3 layers of es - Edge (ToR) - Aggregation - Core Optical Electrical

Early data center limitations Cost - Core, aggregation routers = high capacity, low volume - Expensive! Fault-tolerance - Failure of a single core or aggregation router = large bandwidth loss Bisection bandwidth limited by capacity of largest available router - Google s DC traffic doubles every year!

Clos networks How can I replace a big by many small es? Big

Clos networks How can I replace a big by many small es? Big

Clos Networks What about bigger es?

Multi-rooted tree Every pair of nodes has many paths Fault tolerant! But how do we pick a path?

Multipath routing Lots of bandwidth, split across many paths ECMP: hash on packet header to determine route - (5 tuple): Source IP, port, destination IP, port, prot. - Packets from client server usually take same route On or link failure, ECMP sends subsequent packets along a different route => Out of order packets!

Data Center Network Trends RT latency across data center ~ 10 usec 40 Gbps links common, 100 Gbps on the way 1KB packet every 80ns on a 100Gbps link Direct delivery into the on-chip cache (DDIO) Upper levels of tree are (expensive) optical links Thin tree to reduce costs Within rack > within aisle > within DC > cross DC Latency and bandwidth: keep communication local

Local Storage Magnetic disks for long term storage High latency (10ms), low bandwidth (250MB/s) Compressed and replicated for cost, resilience Solid state storage for persistence, cache layer 50us block access, multi-gb/s bandwidth Emerging NVM Low energy DRAM replacement Sub-microsecond persistence

Day in the Life of a Web Request User types google.com, what happens DNS = Domain Name System Global database of name->ip address DNS query to root name server Root name server is replicated (600+) Hard coded IP multicast addresses Updates happen async in background (rare) Root returns: IP address of google s name server Cached on client (for a configurable period) Can be out of date

Name Server to DC DNS returns google s name server Google name server returns local name server Name server in DC close to user s IP address Local name server returns IP of web front end Also in local data center With short timeout, reroute if front end failure Browser sends HTTP request to front end

Resilient Scalable Web Front End IP address of front end is an array of machines Packets first go through load balancer Actually, an array of load balancers (all the same) Load balancer hashes 3 or 5 tuple -> front end Source IP, port #, Destination IP, port #, protocol This way, all traffic from user goes to same server When front end fails, load balancer redirects incoming packets elsewhere

Challenge How do we build a hash function that s stable? That reroutes only the packets for the failed node, leaving the packets to other nodes alone.

Web Front End -> Cache, Storage Layer Key-value store Sharded across many cache and storage nodes Hash(key) -> cache, storage node If cache node fails, rehash to new cache node If storage node fails,???