CS 514: Transport Protocols for Datacenters

Similar documents
Reliable Communication for Datacenters

Maelstrom: An Enterprise Continuity Protocol for Financial Datacenters

Mesh-Based Content Routing Using XML

Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks

Overview. A Survey of Packet-Loss Recovery Techniques. Outline. Overview. Mbone Loss Characteristics. IP Multicast Characteristics

Maelstrom: Transparent Error Correction for Lambda Networks

The Google File System

On the cost of tunnel endpoint processing in overlay virtual networks

<Insert Picture Here> Oracle Application Cache Solution: Coherence

Content distribution networks

CS505: Distributed Systems

Announcements. me your survey: See the Announcements page. Today. Reading. Take a break around 10:15am. Ack: Some figures are from Coulouris

Supporting Scalability and Adaptability via ADAptive Middleware And Network Transports (ADAMANT)

Spotify Behind the Scenes

Solace Message Routers and Cisco Ethernet Switches: Unified Infrastructure for Financial Services Middleware

Exercises TCP/IP Networking With Solutions

TCP Incast problem Existing proposals

Indirect Communication

Last Class: RPCs and RMI. Today: Communication Issues

Outline Computer Networking. TCP slow start. TCP modeling. TCP details AIMD. Congestion Avoidance. Lecture 18 TCP Performance Peter Steenkiste

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

status Emmanuel Cecchet

draft-begen-fecframe-interleaved-fec-scheme-00 IETF 72 July 2008 Ali C. Begen

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

2005 SPRING SIW Performance Evaluation of the XMSF Overlay Multicast Prototype

Advanced RDMA-based Admission Control for Modern Data-Centers

How Teridion Works. A Teridion Technical Paper. A Technical Overview of the Teridion Virtual Network. Teridion Engineering May 2017

Coherence & WebLogic Server integration with Coherence (Active Cache)

WSN Routing Protocols

Streaming Video over the Internet. Dr. Dapeng Wu University of Florida Department of Electrical and Computer Engineering

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Multicast EECS 122: Lecture 16

Toward Intrusion Tolerant Cloud Infrastructure

Different Layers Lecture 20

Towards a Robust Protocol Stack for Diverse Wireless Networks Arun Venkataramani

Cornell University, Hebrew University * (*) Others are also involved in some aspects of this project I ll mention them when their work arises

Computer Networks. Sándor Laki ELTE-Ericsson Communication Networks Laboratory

1-800-OVERLAYS: Using Overlay Networks to Improve VoIP Quality. Yair Amir, Claudiu Danilov, Stuart Goose, David Hedqvist, Andreas Terzis

Module 16: Distributed System Structures

Hands-On IP Multicasting for Multimedia Distribution Networks

Reliable Distributed System Approaches

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Octoshape. Commercial hosting not cable to home, founded 2003

More on Testing and Large Scale Web Apps

Distributed System Chapter 16 Issues in ch 17, ch 18

Go-Fi or Wi-Go. Photons Everywhere. Jon Crowcroft December, 2005

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper

Multimedia Networking

WebSphere MQ Low Latency Messaging V2.1. High Throughput and Low Latency to Maximize Business Responsiveness IBM Corporation

Network Control and Signalling

Real-Time Reliable Multicast Using Proactive Forward Error Correction

Lecture Outline. Bag of Tricks

IBM Europe Announcement ZP , dated November 6, 2007

Real-Time Reliable Multicast Using Proactive Forward Error Correction

The Measurement Manager Modular End-to-End Measurement Services

Overlay and P2P Networks. Introduction and unstructured networks. Prof. Sasu Tarkoma

Contents. Overview Multicast = Send to a group of hosts. Overview. Overview. Implementation Issues. Motivation: ISPs charge by bandwidth

Adaptive Video Multicasting

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19, NO. 3, JUNE

Oracle and Tangosol Acquisition Announcement

Module 15: Network Structures

Network-Adaptive Video Coding and Transmission

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

Indirect Communication

Distributed Systems Multicast & Group Communication Services

An Approach to Flexible QoS Routing with Active Networks

Congestion Control End Hosts. CSE 561 Lecture 7, Spring David Wetherall. How fast should the sender transmit data?

Da t e: August 2 0 th a t 9: :00 SOLUTIONS

Networking and Internetworking 1

Module 16: Distributed System Structures. Operating System Concepts 8 th Edition,

Multimedia Networking

High Performance Computing Course Notes HPC Fundamentals

Lec 19 - Error and Loss Control

Mobile Transport Layer Lesson 10 Timeout Freezing, Selective Retransmission, Transaction Oriented TCP and Explicit Notification Methods

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

Question. Reliable Transport: The Prequel. Don t parse my words too carefully. Don t be intimidated. Decisions and Their Principles.

Adapting Enterprise Distributed Real-time and Embedded (DRE) Pub/Sub Middleware for Cloud Computing Environments

Opportunistic Application Flows in Sensor-based Pervasive Environments

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

Streaming (Multi)media

Networked Systems and Services, Fall 2018 Chapter 2. Jussi Kangasharju Markku Kojo Lea Kutvonen

Internet Design Principles and Architecture

Resilient Multicast using Overlays

Implementation of a Reliable Multicast Transport Protocol (RMTP)

XCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet

Efficient Buffering in Reliable Multicast Protocols

Delay Constrained ARQ Mechanism for MPEG Media Transport Protocol Based Video Streaming over Internet

Multimedia Networking

Loss Tolerant TCP (LT-TCP): Implementation and Evaluation

Congestion Control in Datacenters. Ahmed Saeed

Streaming Video and TCP-Friendly Congestion Control

TCP so far Computer Networking Outline. How Was TCP Able to Evolve

DTN Interworking for Future Internet Presented by Chang, Dukhyun

What is the difference between unicast and multicast? (P# 114)

Perfect Video over Any Network. State-of-the-art Technology for Live Video Comunications

Introduction to Distributed Systems. INF5040/9040 Autumn 2018 Lecturer: Eli Gjørven (ifi/uio)

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Reliable Multicast for Time-Critical Systems

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Transcription:

Department of Computer Science Cornell University

Outline Motivation 1 Motivation 2 3

Commodity Datacenters Blade-servers, Fast Interconnects Different Apps: Google -> Search Amazon -> Etailing Computational Finance, Aerospace, Military C&C, e-science...... YouTube?

The Datacenter Paradigm Extreme Scale-out More Nodes, More Capacity Services distributed / replicated / partitioned over multiple nodes Server Datacenter Client Client

Building Datacenter Apps High-level Abstractions: Publish/Subscribe Event Notification Replication (Data/Functionality) Caching BEA Weblogic, JBoss, IBM Websphere, Tibco, RTI DDS, Tangosol, Gemfire... What s under the hood? Multicast!

Properties of a Multicast Primitive Rapid Delivery... when failures occur (reliable)... at extreme scales (scalable) Questions: What technology do current systems use? Is it truly reliable and scalable? Can we do better?

A Brief History of Multicast IP Multicast - Deering, et al., 1988. Limited Deployment the Mbone. Two Divergent Directions: Overlay Multicast instead of IP Multicast (e.g, BitTorrent) Reliable Multicast over IP Multicast (e.g, TIBCO) Datacenters have IP Multicast support...

Multicast Research Motivation Many different reliable, scalable protocols Designed for streaming video/tv, file distribution Reliable: Packet Loss at WAN routers Scalable: Single group with massive numbers of receivers Not suited for datacenter multicast! Different failure mode Different scalability dimensions

(Reliable) Multicast in the Datacenter Packet Loss occurs at end-hosts: independent and bursty burst length (packets) 100 80 60 40 20 0 60 80 100 receiver r 1 : loss bursts in A and B receiver r 1 : data bursts in A and B 120 140 160 180 burst length (packets) 100 80 60 40 20 0 60 80 100 120 time in seconds receiver r 2 : loss bursts in A receiver r 2 : data bursts in A 140 160 180

(Scalable) Multicast in the Datacenter Financial Datacenter Example: Each equity is mapped to a multicast group. Each Node is interested in a different set of equities...... each Node joins a different set of groups. Tracking Portfolio Tracking S&P 500 Lots of overlapping groups = Low per-group data rate.

Designing a Time-Critical Multicast Primitive Wanted: A reliable, scalable multicast protocol. Reliable: can tolerate end-host loss bursts Scalable: the size of the group the number of senders to a group the number of groups per node

Designing a Time-Critical Multicast Primitive Wanted: A reliable, scalable multicast protocol. Reliable: can tolerate end-host loss bursts Scalable: the size of the group the number of senders to a group the number of groups per node

Designing a Time-Critical Multicast Primitive Wanted: A reliable, scalable multicast protocol. Reliable: can tolerate end-host loss bursts Scalable: the size of the group the number of senders to a group the number of groups per node

Outline Motivation 1 Motivation 2 3

Design Space for Reliable Multicast How does latency scale? Two Phases: Discovery and Recovery of Lost Packets ACK/timeout: RMTP/RMTP-II Gossip-based: Bimodal Multicast, lpbcast NAK/sender-based sequencing: SRM Forward Error Correction Fundamental Insight: latencyα 1 datarate

NAK/Sender-Based Sequencing: SRM Scalable Reliable Multicast - Developed 1998 Loss discovered on next packet from same sender in same group latency α 1 datarate data rate: at a single sender, in a single group Milliseconds 9000 SRM Average Discovery Delay 8000 7000 6000 5000 4000 3000 2000 1000 0 1 2 4 8 16 32 64 128 Groups per Node (Log Scale)

Forward Error Correction Pros: Tunable, Proactive Overhead Time-Critical: Eliminates need for retransmission Cons: FEC packets are generated over a stream of data Have to wait for r data packets before generating FEC latency α 1 datarate data rate: at a single sender, in a single group

Receiver-Based Forward Error Correction Randomness: Each Receiver picks another Receiver randomly to send XOR to Tunability: Percentage of XOR packets to data is determined by rate-of-fire (r, c) 1 latency α datarate data rate: across all senders, in a single group Data Data Lost Data XOR XOR

Receiver-Based Forward Error Correction Randomness: Each Receiver picks another Receiver randomly to send XOR to Tunability: Percentage of XOR packets to data is determined by rate-of-fire (r, c) 1 latency α datarate data rate: across all senders, in a single group Data Data Lost Data XOR XOR

Receiver-Based Forward Error Correction Randomness: Each Receiver picks another Receiver randomly to send XOR to Tunability: Percentage of XOR packets to data is determined by rate-of-fire (r, c) 1 latency α datarate data rate: across all senders, in a single group Data Data Lost Data XOR XOR

Receiver-Based Forward Error Correction Randomness: Each Receiver picks another Receiver randomly to send XOR to Tunability: Percentage of XOR packets to data is determined by rate-of-fire (r, c) 1 latency α datarate data rate: across all senders, in a single group Kernel Buffer F E D B C App Buffer Kernel Buffer F E D C B A App Buffer

Receiver-Based Forward Error Correction Randomness: Each Receiver picks another Receiver randomly to send XOR to Tunability: Percentage of XOR packets to data is determined by rate-of-fire (r, c) 1 latency α datarate data rate: across all senders, in a single group Kernel Buffer App Buffer XOR B C D E F Kernel Buffer X E A B C D App Buffer

Receiver-Based Forward Error Correction Randomness: Each Receiver picks another Receiver randomly to send XOR to Tunability: Percentage of XOR packets to data is determined by rate-of-fire (r, c) 1 latency α datarate data rate: across all senders, in a single group Kernel Buffer App Buffer XOR B C D E F Kernel Buffer X E A B C D App Buffer

Lateral Error Correction: Principle Nodes n 1 and n 2 are both in groups A and B. n 1 n 2 A1 A2 A3 INCOMING DATA PACKETS B1 B2 A4 B3 B4 A5 A6 Repair Packet I: (A1, A2, A3, A4, A5) B5 Repair Packet II: (B1,B2,B3,B4,B5)

Lateral Error Correction: Principle Nodes n 1 and n 2 are both in groups A and B. INCOMING DATA PACKETS A1 A2 A3 B1 B2 A4 B3 B4 A5 A6 B5 n 1 n 2 Repair Packet I: (A1, A2, A3, A4, A5) Repair Packet II: (B1,B2,B3,B4,B5) A1 A2 A3 B1 B2 A4 B3 B4 A5 A6 B5 n 1 n 2 Repair Packet I: (A1, A2, A3, B1, B2) Repair Packet II: (A4, A5, B3, B4, A6)

Lateral Error Correction Combine error traffic for multiple groups within intersections, while conserving: Coherent, tunable per-group overhead: Ratio of data packets to repair packets in the system is r : c Fairness: Each node receives on average the same ratio of repair packets to data packets 1 latency α datarate data rate: across all senders, in intersections of groups

Lateral Error Correction Combine error traffic for multiple groups within intersections, while conserving: Coherent, tunable per-group overhead: Ratio of data packets to repair packets in the system is r : c Fairness: Each node receives on average the same ratio of repair packets to data packets 1 latency α datarate data rate: across all senders, in intersections of groups

Lateral Error Correction: Mechanism Divide overlapping groups into regions 4 2 1 n 1 belongs to groups A, B, C: It divides them into regions abc, ab, ac, bc, a, b, c

A abc Motivation Lateral Error Correction: Mechanism n 1 selects proportionally sized chunks of c A from the regions of A Total number of targets selected, across regions, is equal to the c value of a group A x A A A 1 A a A ab A ac 1

Repair Bin Structure Motivation Repair Bins: Input: Data Packets in union of Groups Output: Repair Packets to region Expectation: Avg # of targets chosen from region 1 a =5 ab =2 b =1 abc =10 ac =3 bc =7 c =5

Experimental Evaluation Cornell Cluster: 64 1.3 Ghz nodes Java Implementation running on Linux 2.6.12 Three Loss Models: {Uniform, Burst, Markov} Grouping Parameters: g s = d n g: Number of Groups in System s: Average Size of Group d: Groups joined by each Node n: Number of Nodes in System Each node joins d randomly selected groups from g groups

Distribution of Recovery Latency 16 Nodes, 128 groups per node, 10 nodes per group, Uniform 1% Loss 50 96.8% LEC + 3.2% NAK 50 92% LEC + 8% NAK 50 84% LEC + 16% NAK Recovery Percentage 40 30 20 10 Recovery Percentage 40 30 20 10 Recovery Percentage 40 30 20 10 0 0 50 100 150 200 250 Milliseconds 0 0 50 100 150 200 250 Milliseconds 0 0 50 100 150 200 250 Milliseconds (a) 10% Loss Rate (b) 15% Loss Rate (c) 20% Loss Rate

Scalability in Groups 64 nodes, * groups per node, 10 nodes per group, Loss Model: Uniform 1% 100 Recovery % 50000 Average Recovery Latency LEC Recovery Percentage 98 96 94 92 Microseconds 40000 30000 20000 10000 90 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) 0 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) scales to hundreds of groups. Comparison: at 128 groups, SRM latency was 8 seconds. 400 times slower!

CPU time and XORs per data packet 64 nodes, * groups per node, 10 nodes per group, Loss Model: Uniform 1% Microseconds 450 400 350 300 250 200 150 100 50 0 Updating Repair Bins Data Packet Accounting Waiting in Buffer XOR 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) Number 5 4 3 2 1 0 XORs per Data Packet 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) is lightweight = Time-Critical Apps can run over it

Impact of Loss Rate on LEC 64 nodes, 128 groups per node, 10 nodes per group, Loss Model: * LEC Recovery Percentage 100 80 60 40 Effect of Loss Rate on Recovery % 20 Bursty b=10 Uniform Markov m=10 0 0 0.05 0.1 0.15 0.2 0.25 Loss Rate Recovery Latency (Microseconds) 55000 50000 45000 40000 35000 30000 25000 20000 15000 10000 Effect of Loss Rate on Latency Bursty b=10 Uniform Markov m=10 0 0.05 0.1 0.15 0.2 0.25 Loss Rate Works well at typical datacenter loss rates: 1-5%

Resilience to Burstiness 64 nodes, 128 groups per node, 10 nodes per group, Loss Model: Bursty 1% Recovery Percentage 100 90 80 70 60 50 40 Resilience to Bursty Losses: Recovery Percentage 30 5 10 15 20 25 30 35 40 Burst Size L=0.1% L=1% L=10% Recovery Latency (Microseconds) 65000 60000 55000 50000 45000 40000 35000 30000 25000 20000 15000 Resilience to Bursty Losses: Recovery Latency L=0.1% L=1% L=10% 5 10 15 20 25 30 35 40 Burst Size... can handle short bursts (5-10 packets) well. Good enough?

Staggering 64 nodes, 128 groups per node, 10 nodes per group, Loss Model: Bursty 1% Recovery Percentage 100 90 80 70 60 50 40 30 20 10 Staggering: LEC Recovery Percentage 1 2 3 4 5 6 7 8 9 10 Stagger b=10 burst=50 burst=100 Uniform Recovery Latency (Microseconds) 65000 60000 55000 50000 45000 40000 35000 30000 25000 20000 15000 10000 Staggering: Recovery Latency 1 2 3 4 5 6 7 8 9 10 Stagger b=10 burst=50 burst=100 Uniform Stagger of i: Encode every ith packet Stagger 6, burst of 100 packets = 90% recovered at 50 ms!

: Overview Motivation Time-Critical Datacenters: large numbers of low-rate groups bursty end-host loss patterns is the first protocol to scale in the number of groups in the system

Outline Motivation 1 Motivation 2 3

Open Problem: LambdaNets The Lambda Internet: A collection of geographically dispersed datacenters...... connected by optical lambda links Applications need to run over LambdaNets: Financial services operating in different markets MNCs with operations in different countries High-volume e-tailers

Why is this hard? Motivation Speed of Light! Existing systems are not designed for very high communication latencies: Try executing a Java RMI call on a server sitting in India... Or mirroring your Oracle database to Kentucky... Need for fundamental redesign of software stack

Data Transport for the Lambda Internet TCP/IP uses RTT-based timeouts and retransmissions...... hundreds of milliseconds to recover lost packets! FEC: Perfect technology for long-distance transfer...... but useless if loss is bursty. : Decorrelated FEC Constructs repair packets from across multiple outgoing channels from one datacenter to another

Datacenters are the present (and future) The applications you build will run on Datacenters Current technology works... barely. Next-generation applications will push the limits of scalability: What if all TV is IP-based (YouTube on steroids)? What if all your data/functionality is remote? (AJAX-based Apps...) What if everything is remote? (Web-based Operating...)