Utilizing Datacenter Networks: Centralized or Distributed Solutions?

Similar documents
Improving datacenter performance and robustness with multipath TCP

Routing Domains in Data Centre Networks. Morteza Kheirkhah. Informatics Department University of Sussex. Multi-Service Networks July 2011

GRIN: Utilizing the Empty Half of Full Bisection Networks

Advanced Computer Networks. Datacenter TCP

Lecture 7: Data Center Networks

Data Center Network Topologies II

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Networking Recap Storage Intro. CSE-291 (Cloud Computing), Fall 2016 Gregory Kesden

Advanced Computer Networks. Datacenter TCP

TCP Sendbuffer Advertising. Costin Raiciu University Politehnica of Bucharest

6.888: Lecture 4 Data Center Load Balancing

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

DevoFlow: Scaling Flow Management for High-Performance Networks

Shaping the Linux kernel MPTCP implementation towards upstream acceptance

Optimizing Network Performance in Distributed Machine Learning. Luo Mai Chuntao Hong Paolo Costa

TCP. The TCP Protocol. TCP Header Format. TCP Flow Control. TCP Congestion Control Datacenter TCP

Outline Computer Networking. TCP slow start. TCP modeling. TCP details AIMD. Congestion Avoidance. Lecture 18 TCP Performance Peter Steenkiste

Micro load balancing in data centers with DRILL

Data Centers. Tom Anderson

From Routing to Traffic Engineering

MPTCP: Design and Deployment. Day 11

L19 Data Center Network Architectures

TCP so far Computer Networking Outline. How Was TCP Able to Evolve

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

DARD: A Practical Distributed Adaptive Routing Architecture for Datacenter Networks

Jellyfish: Networking Data Centers Randomly

Responsive Multipath TCP in SDN-based Datacenters

TCP improvements for Data Center Networks

Leveraging Heterogeneity to Reduce the Cost of Data Center Upgrades

c-through: Part-time Optics in Data Centers

Stateless Datacenter Load Balancing with Beamer

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

CSE 124: THE DATACENTER AS A COMPUTER. George Porter November 20 and 22, 2017

Developing Multipath TCP. Damon Wischik, Mark Handley, Costin Raiciu

Introduction. Network Architecture Requirements of Data Centers in the Cloud Computing Era

Urban dashboardistics Damon Wischik Dept. of Computer Science and Technology

Improving Multipath TCP. PhD Thesis - Christoph Paasch

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Multipath TCP. Prof. Mark Handley Dr. Damon Wischik Costin Raiciu University College London

BLEST: Blocking Estimation-based MPTCP Scheduler for Heterogeneous Networks. Simone Ferlin, Ozgu Alay, Olivier Mehani and Roksana Boreli

STMS: Improving MPTCP Throughput Under Heterogenous Networks

Making Multipath TCP friendlier to Load Balancers and Anycast

Hardware Evolution in Data Centers

SPAIN: High BW Data-Center Ethernet with Unmodified Switches. Praveen Yalagandula, HP Labs. Jayaram Mudigonda, HP Labs

Advanced Computer Networks Exercise Session 7. Qin Yin Spring Semester 2013

CS244a: An Introduction to Computer Networks

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Congestion Control of MPTCP: Performance Issues and a Possible Solution

DevoFlow: Scaling Flow Management for High Performance Networks

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Per-Packet Load Balancing in Data Center Networks

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

NaaS Network-as-a-Service in the Cloud

Hedera: Dynamic Flow Scheduling for Data Center Networks

Lecture 16: Data Center Network Architectures

XCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet

Multipath TCP. Prof. Mark Handley Dr. Damon Wischik Costin Raiciu University College London

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

EXPERIENCES EVALUATING DCTCP. Lawrence Brakmo, Boris Burkov, Greg Leclercq and Murat Mugan Facebook

Page 1. Review: Internet Protocol Stack. Transport Layer Services. Design Issue EEC173B/ECS152C. Review: TCP

A Two-Phase Multipathing Scheme based on Genetic Algorithm for Data Center Networking

Today s Agenda. Today s Agenda 9/8/17. Networking and Messaging

Improving the Robustness of TCP to Non-Congestion Events

Explicit Path Control in Commodity Data Centers: Design and Applications

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

The Google File System

Implementation Experiments on HighSpeed and Parallel TCP

Jellyfish. networking data centers randomly. Brighten Godfrey UIUC Cisco Systems, September 12, [Photo: Kevin Raskoff]

Distributed Filesystem

Delay Controlled Elephant Flow Rerouting in Software Defined Network

TCP and BBR. Geoff Huston APNIC

DB2 is a complex system, with a major impact upon your processing environment. There are substantial performance and instrumentation changes in

Computer Networks. Sándor Laki ELTE-Ericsson Communication Networks Laboratory

DFFR: A Distributed Load Balancer for Data Center Networks

CS 6453 Network Fabric Presented by Ayush Dubey

SAFEGUARDING YOUR VIRTUALIZED RESOURCES ON THE CLOUD. May 2012

A Network-aware Scheduler in Data-parallel Clusters for High Performance

One More Bit Is Enough

CS 43: Computer Networks. 19: TCP Flow and Congestion Control October 31, Nov 2, 2018

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Low Latency via Redundancy

SPARTA: Scalable Per-Address RouTing Architecture

Packet Scheduling in Data Centers. Lecture 17, Computer Networks (198:552)

Application of SDN: Load Balancing & Traffic Engineering

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Congestion Control. Brighten Godfrey CS 538 January Based in part on slides by Ion Stoica

Page 1. Review: Internet Protocol Stack. Transport Layer Services EEC173B/ECS152C. Review: TCP. Transport Layer: Connectionless Service

The Google File System

On the Efficacy of Fine-Grained Traffic Splitting Protocols in Data Center Networks

TCP and BBR. Geoff Huston APNIC

Mahout: Low-Overhead Datacenter Traffic Management using End-Host-Based Elephant Detection. Vasileios Dimitrakis

Distributed Systems. 31. The Cloud: Infrastructure as a Service Paul Krzyzanowski. Rutgers University. Fall 2013

Implementation and Evaluation of Coupled Congestion Control for Multipath TCP

Data Center TCP (DCTCP)

Computer Networks. Course Reference Model. Topic. Congestion What s the hold up? Nature of Congestion. Nature of Congestion 1/5/2015.

Concise Paper: Freeway: Adaptively Isolating the Elephant and Mice Flows on Different Transmission Paths

Congestion Control in Datacenters. Ahmed Saeed

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Data Centers and Cloud Computing. Data Centers

Transcription:

Utilizing Datacenter Networks: Centralized or Distributed Solutions? Costin Raiciu Department of Computer Science University Politehnica of Bucharest

We ve gotten used to great applications

Enabling Such Apps is Hard Apps Process huge amounts of data Are fast Are reliable One machine is not enough Limited reliability, speed Super computers are expensive Use many commodity machines instead

Data Centers Rule the World Cloud computing Economies of scale: networks of tens of thousands of hosts Datacenter apps support web search, online stores Web search, GFS, BigTable, DryadLINQ, MapReduce Dense traffic patterns Intra datacenter traffic is increasing in volume

Flexibility is Important in Data Centers Apps distributed across thousands of machines. Flexibility want any machine to be able to play any role.

Traditional Data Center Topology Internet Core Switch 10Gbps 10Gbps 1Gbps Aggrega5on Switches Top of Rack Switches Racks of servers

Fat Tree Topology [Fares et al., 2008; Clos, 1953] K=4 Aggrega5on Switches 1Gbps 1Gbps K Pods with K Switches each Racks of servers

VL2 Topology [Greenberg et al, 2009, Clos topology] 10Gbps 10Gbps 20 hosts

BCube Topology [Guo et al, 2009] BCube (4,1)

How Do We Route Packets in Data Centers? Traditional Routing Spanning Tree Protocol - kills all redundancy Instead datacenters use one of the following techniques: Multiple VLANs OSPF TRILL (new IETF standard)

How Do We Use this Capacity? Need to distribute flows across available paths. Basic solution: Random Load Balancing. Use Equal-Cost Multipath (ECMP) routing (OSPF, TRILL) Hash to a path at random. Sources randomly pick a VLANs. In practice sources have multiple interfaces pick a random source address for the flow

Collisions 1Gbps 1Gbps Racks of servers

Single-path TCP collisions reduce throughput

How bad are collisions? Capacity wasted (worst case): FatTree 60% BCube 50% VL2 25%

How do we address this problem? I will discuss two solutions Flow scheduling Multipath TCP

Flow Scheduling Hedera Fares et al. NSDI 2010

Solving Collisions with Flow Scheduling Centralized Controller 2. Compute flow demands 1. Pull stats, detect large flows 3. Compute placement 4. Place flows 1Gbps OF OF OF OF 1Gbps OF OF OF OF OF OF OF OF OF OF OF OF OF OF OF OF Racks of servers

Hedera Main Idea Schedule elephant flows They carry most of the bytes ECMP deals with short flows

Detecting Elephants Pull edge switches for byte counts Flows exceeding 100Mb/s are large What if only short flows? ECMP should be good enough

Demand Estimation Current flow rates are a poor indicator of flow demand Network could be the bottleneck Hedera s approach: what would this flow get if the network was not a bottleneck?

Demand estimation: simple example 1Gb/s 500Mb/s 500Mb/s General Approach: Iterative algorithm

Allocating Flows to Paths Multi-Commodity Flow Problem Single path forwarding Expressed as Binary Integer Programming NP-Complete Solvers give exact solution but are impractical for large networks

Approximating Multi-Commodity Flow Global First Fit Linearly search all paths until one that can accommodate the traffic is found Flows placed upon detection, are not moved Simulated Annealing Probabilistic search for good solutions that maximize bisection bandwidth

Fault Tolerance Scheduler failure all soft state, just fall back to ECMP Link, switch failures Portland notifies the scheduler

Does it work?

Hedera: One Flow, One Path Centralized Can it scale to really large datacenters? Needs a very tight control loop How often does it need to run to achieve these benefits? Strong assumption: traffic is always bottlenecked by the network What about app-bound traffic, e.g disk reads/writes?

Hedera: One Flow, One Path Centralized Can it scale to really large datacenters? MAYBE Needs a very tight control loop FIXABLE This is the wrong place to start How often does it need to run to achieve these benefits? Strong assumption: Only Hosts Know traffic is always bottlenecked by the network What about app-bound traffic, e.g disk reads/writes?

Multipath topologies need multipath transport Multipath transport enables better topologies

Collision

Not fair

Not fair

No matter how you do it, mapping each flow to a path is the wrong goal

Instead, we should pool capacity from different links

Instead, we should pool capacity from different links

Instead, we should pool capacity from different links

Instead, we should pool capacity from different links

Multipath Transport

Multipath Transport can pool datacenter networks Instead of using one path for each flow, use many random paths Don t worry about collisions. Just don t send (much) traffic on colliding paths

Multipath TCP Primer [IETF MPTCP WG] MPTCP is a drop in replacement for TCP Works with unmodified applications Over the existing network

MPTCP Operation SYN MP_CAPABLE X

MPTCP Operation SYN/ACK MP_CAPABLE Y

MPTCP Operation STATE 1 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation STATE 1 CWND Snd.SEQNO Rcv.SEQNO SYN JOIN Y

MPTCP Operation STATE 1 CWND Snd.SEQNO Rcv.SEQNO SYN/ACK JOIN X

MPTCP Operation STATE 1 CWND Snd.SEQNO Rcv.SEQNO STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation SEQ 1000 options DSEQ 10000 DATA STATE 1 CWND Snd.SEQNO Rcv.SEQNO STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation SEQ 1000 options DSEQ 10000 DATA STATE 1 CWND Snd.SEQNO Rcv.SEQNO STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation SEQ 1000 options DSEQ 10000 DATA STATE 1 CWND Snd.SEQNO Rcv.SEQNO SEQ 5000 options DSEQ 11000 DATA STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation SEQ 1000 options DSEQ 10000 DATA STATE 1 CWND Snd.SEQNO Rcv.SEQNO SEQ 5000 options DSEQ 11000 DATA STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation SEQ 1000 options DSEQ 10000 DATA STATE 1 CWND Snd.SEQNO Rcv.SEQNO SEQ 5000 options DSEQ 11000 DATA STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation SEQ 1000 options DSEQ 10000 DATA STATE 1 CWND Snd.SEQNO Rcv.SEQNO SEQ 5000 options DSEQ 11000 DATA STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation ACK 2000 STATE 1 CWND Snd.SEQNO Rcv.SEQNO STATE 2 CWND Snd.SEQNO Rcv.SEQNO

MPTCP Operation SEQ 2000 options DSEQ 11000 DATA STATE 1 CWND Snd.SEQNO Rcv.SEQNO STATE 2 CWND Snd.SEQNO Rcv.SEQNO

Multipath TCP: Congestion Control [NSDI, 2011]

MPTCP better utilizes the FatTree network

MPTCP on EC2 Amazon EC2: infrastructure as a service We can borrow virtual machines by the hour These run in Amazon data centers worldwide We can boot our own kernel A few availability zones have multipath topologies 2-8 paths available between hosts not on the same machine or in the same rack Available via ECMP

Amazon EC2 Experiment 40 medium CPU instances running MPTCP For 12 hours, we sequentially ran all-to-all iperf cycling through: TCP MPTCP (2 and 4 subflows)

MPTCP improves performance on EC2

Where do MPTCP s benefits come from?

Allocating Flows to Paths Multi-Commodity Flow Problem Single path forwarding Expressed as Binary Integer Programming NP-Complete Solvers give exact solution but are impractical for large networks

Allocating Flows to Paths Multi-Commodity Flow Problem Single path forwarding Expressed as Binary Integer Programming NP-Complete Solvers give exact solution but are impractical for large networks Multipath forwarding Expressed as Linear Programming problem Solvable in polynomial time

How many subflows are needed? How does the topology affect results? How does the traffic matrix affect results?

At most 8 subflows are needed Total Throughput TCP

MPTCP improves fairness in VL2 topologies VL2 Fairness is important: Jobs finish when the slowest worker finishes

MPTCP improves throughput and fairness in BCube BCube Single path TCP optimum

Oversubscribed Topologies To saturate full bisectional bandwidth: There must be no traffic locality All hosts must send at the same time Host links must not be bottlenecks It makes sense to under-provision the network core This is what happens in practice Does MPTCP still provide benefits?

Performance improvements depend on traffic matrix Underloaded Sweet Spot Overloaded Increase Load

MPTCP vs. Centralized Scheduling

MPTCP vs Hedera First Fit Centralized Scheduling MPTCP Infinite Scheduling Interval

Centralized Scheduling: Setting the Threshold Throughput 1Gbps Hope 17% worse than multipath TCP 100Mbps App Limited

Centralized Scheduling: Setting the Threshold Throughput 1Gbps 21% worse than multipath TCP 100Mbps App Limited Hope

Centralized Scheduling: Setting the Threshold Throughput 1Gbps 500Mbps 100Mbps 51% 17% 45% 21%

MPTCP vs. Hedera MPTCP HEDERA Implementation Distributed Centralized Network changes No Yes, upgrade all switches to OF Hardware needed No Centralized Scheduler Software changes Yes host stack No Scope Schedules more flows Large flows only Convergence Time Scale Invariant, RTTs Tight Control Loop Limits Scalability Fairness Fair Less fair

What is an optimal datacenter topology for multipath transport?

In single homed topologies: Hosts links are often bottlenecks ToR switch failures wipe out tens of hosts for days Multi-homing servers is the obvious way forward

Fat Tree Topology

Fat Tree Topology Upper Pod Switch ToR Switch Servers

Dual Homed Fat Tree Topology Upper Pod Switch ToR Switch Servers

Is DHFT any better than Fat Tree? Not for traffic matrices that fully utilize the core Let s examine random traffic patterns

DHFT provides significant improvements when core is not overloaded Core Underloaded Core Overloaded

Summary One flow, one path thinking has constrained datacenter design Collisions, unfairness, limited utilization Fixing these is possible, but does not address the bigger issue Multipath transport enables resource pooling in datacenter networks: Improves throughput Improves fairness Improves robustness One flow, many paths frees designers to consider topologies that offer improved performance for similar cost

Backup Slides

Effect of MPTCP on short flows Flow sizes from VL2 dataset MPTCP enabled for long flows only (timer) Oversubscribed Fat Tree topology Results: TCP/ECMP Completion time: 79ms Core Utilization: 25% MPTCP 97ms 65%

Effect of Locality in the Dual Homed Fat Tree

Overloaded Fat Tree: better fairness with Multipath TCP

VL2 Topology [Greenberg et al, 2009, Clos topology] 10Gbps 10Gbps 20 hosts

BCube Topology [Guo et al, 2009] BCube (4,1)