CS 514: Transport Protocols for Datacenters

Department of Computer Science Cornell University

Outline Motivation 1 Motivation 2 3

Commodity Datacenters Blade-servers, Fast Interconnects Different Apps: Google -> Search Amazon -> Etailing Computational Finance, Aerospace, Military C&C, e-science...... YouTube?

The Datacenter Paradigm Extreme Scale-out More Nodes, More Capacity Services distributed / replicated / partitioned over multiple nodes Server Datacenter Client Client

Building Datacenter Apps High-level Abstractions: Publish/Subscribe Event Notification Replication (Data/Functionality) Caching BEA Weblogic, JBoss, IBM Websphere, Tibco, RTI DDS, Tangosol, Gemfire... What s under the hood? Multicast!

Properties of a Multicast Primitive Rapid Delivery... when failures occur (reliable)... at extreme scales (scalable) Questions: What technology do current systems use? Is it truly reliable and scalable? Can we do better?

A Brief History of Multicast IP Multicast - Deering, et al., 1988. Limited Deployment the Mbone. Two Divergent Directions: Overlay Multicast instead of IP Multicast (e.g, BitTorrent) Reliable Multicast over IP Multicast (e.g, TIBCO) Datacenters have IP Multicast support...

Multicast Research Motivation Many different reliable, scalable protocols Designed for streaming video/tv, file distribution Reliable: Packet Loss at WAN routers Scalable: Single group with massive numbers of receivers Not suited for datacenter multicast! Different failure mode Different scalability dimensions

(Reliable) Multicast in the Datacenter Packet Loss occurs at end-hosts: independent and bursty burst length (packets) 100 80 60 40 20 0 60 80 100 receiver r 1 : loss bursts in A and B receiver r 1 : data bursts in A and B 120 140 160 180 burst length (packets) 100 80 60 40 20 0 60 80 100 120 time in seconds receiver r 2 : loss bursts in A receiver r 2 : data bursts in A 140 160 180

(Scalable) Multicast in the Datacenter Financial Datacenter Example: Each equity is mapped to a multicast group. Each Node is interested in a different set of equities...... each Node joins a different set of groups. Tracking Portfolio Tracking S&P 500 Lots of overlapping groups = Low per-group data rate.

Designing a Time-Critical Multicast Primitive Wanted: A reliable, scalable multicast protocol. Reliable: can tolerate end-host loss bursts Scalable: the size of the group the number of senders to a group the number of groups per node

Outline Motivation 1 Motivation 2 3

Design Space for Reliable Multicast How does latency scale? Two Phases: Discovery and Recovery of Lost Packets ACK/timeout: RMTP/RMTP-II Gossip-based: Bimodal Multicast, lpbcast NAK/sender-based sequencing: SRM Forward Error Correction Fundamental Insight: latencyα 1 datarate

NAK/Sender-Based Sequencing: SRM Scalable Reliable Multicast - Developed 1998 Loss discovered on next packet from same sender in same group latency α 1 datarate data rate: at a single sender, in a single group Milliseconds 9000 SRM Average Discovery Delay 8000 7000 6000 5000 4000 3000 2000 1000 0 1 2 4 8 16 32 64 128 Groups per Node (Log Scale)

Forward Error Correction Pros: Tunable, Proactive Overhead Time-Critical: Eliminates need for retransmission Cons: FEC packets are generated over a stream of data Have to wait for r data packets before generating FEC latency α 1 datarate data rate: at a single sender, in a single group

Receiver-Based Forward Error Correction Randomness: Each Receiver picks another Receiver randomly to send XOR to Tunability: Percentage of XOR packets to data is determined by rate-of-fire (r, c) 1 latency α datarate data rate: across all senders, in a single group Data Data Lost Data XOR XOR

Lateral Error Correction: Principle Nodes n 1 and n 2 are both in groups A and B. n 1 n 2 A1 A2 A3 INCOMING DATA PACKETS B1 B2 A4 B3 B4 A5 A6 Repair Packet I: (A1, A2, A3, A4, A5) B5 Repair Packet II: (B1,B2,B3,B4,B5)

Lateral Error Correction: Principle Nodes n 1 and n 2 are both in groups A and B. INCOMING DATA PACKETS A1 A2 A3 B1 B2 A4 B3 B4 A5 A6 B5 n 1 n 2 Repair Packet I: (A1, A2, A3, A4, A5) Repair Packet II: (B1,B2,B3,B4,B5) A1 A2 A3 B1 B2 A4 B3 B4 A5 A6 B5 n 1 n 2 Repair Packet I: (A1, A2, A3, B1, B2) Repair Packet II: (A4, A5, B3, B4, A6)

Lateral Error Correction Combine error traffic for multiple groups within intersections, while conserving: Coherent, tunable per-group overhead: Ratio of data packets to repair packets in the system is r : c Fairness: Each node receives on average the same ratio of repair packets to data packets 1 latency α datarate data rate: across all senders, in intersections of groups

Lateral Error Correction: Mechanism Divide overlapping groups into regions 4 2 1 n 1 belongs to groups A, B, C: It divides them into regions abc, ab, ac, bc, a, b, c

A abc Motivation Lateral Error Correction: Mechanism n 1 selects proportionally sized chunks of c A from the regions of A Total number of targets selected, across regions, is equal to the c value of a group A x A A A 1 A a A ab A ac 1

Repair Bin Structure Motivation Repair Bins: Input: Data Packets in union of Groups Output: Repair Packets to region Expectation: Avg # of targets chosen from region 1 a =5 ab =2 b =1 abc =10 ac =3 bc =7 c =5

Experimental Evaluation Cornell Cluster: 64 1.3 Ghz nodes Java Implementation running on Linux 2.6.12 Three Loss Models: {Uniform, Burst, Markov} Grouping Parameters: g s = d n g: Number of Groups in System s: Average Size of Group d: Groups joined by each Node n: Number of Nodes in System Each node joins d randomly selected groups from g groups

Distribution of Recovery Latency 16 Nodes, 128 groups per node, 10 nodes per group, Uniform 1% Loss 50 96.8% LEC + 3.2% NAK 50 92% LEC + 8% NAK 50 84% LEC + 16% NAK Recovery Percentage 40 30 20 10 Recovery Percentage 40 30 20 10 Recovery Percentage 40 30 20 10 0 0 50 100 150 200 250 Milliseconds 0 0 50 100 150 200 250 Milliseconds 0 0 50 100 150 200 250 Milliseconds (a) 10% Loss Rate (b) 15% Loss Rate (c) 20% Loss Rate

Scalability in Groups 64 nodes, * groups per node, 10 nodes per group, Loss Model: Uniform 1% 100 Recovery % 50000 Average Recovery Latency LEC Recovery Percentage 98 96 94 92 Microseconds 40000 30000 20000 10000 90 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) 0 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) scales to hundreds of groups. Comparison: at 128 groups, SRM latency was 8 seconds. 400 times slower!

CPU time and XORs per data packet 64 nodes, * groups per node, 10 nodes per group, Loss Model: Uniform 1% Microseconds 450 400 350 300 250 200 150 100 50 0 Updating Repair Bins Data Packet Accounting Waiting in Buffer XOR 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) Number 5 4 3 2 1 0 XORs per Data Packet 2 4 8 16 32 64 128 256 512 1024 Groups per Node (Log Scale) is lightweight = Time-Critical Apps can run over it

Impact of Loss Rate on LEC 64 nodes, 128 groups per node, 10 nodes per group, Loss Model: * LEC Recovery Percentage 100 80 60 40 Effect of Loss Rate on Recovery % 20 Bursty b=10 Uniform Markov m=10 0 0 0.05 0.1 0.15 0.2 0.25 Loss Rate Recovery Latency (Microseconds) 55000 50000 45000 40000 35000 30000 25000 20000 15000 10000 Effect of Loss Rate on Latency Bursty b=10 Uniform Markov m=10 0 0.05 0.1 0.15 0.2 0.25 Loss Rate Works well at typical datacenter loss rates: 1-5%

Resilience to Burstiness 64 nodes, 128 groups per node, 10 nodes per group, Loss Model: Bursty 1% Recovery Percentage 100 90 80 70 60 50 40 Resilience to Bursty Losses: Recovery Percentage 30 5 10 15 20 25 30 35 40 Burst Size L=0.1% L=1% L=10% Recovery Latency (Microseconds) 65000 60000 55000 50000 45000 40000 35000 30000 25000 20000 15000 Resilience to Bursty Losses: Recovery Latency L=0.1% L=1% L=10% 5 10 15 20 25 30 35 40 Burst Size... can handle short bursts (5-10 packets) well. Good enough?

Staggering 64 nodes, 128 groups per node, 10 nodes per group, Loss Model: Bursty 1% Recovery Percentage 100 90 80 70 60 50 40 30 20 10 Staggering: LEC Recovery Percentage 1 2 3 4 5 6 7 8 9 10 Stagger b=10 burst=50 burst=100 Uniform Recovery Latency (Microseconds) 65000 60000 55000 50000 45000 40000 35000 30000 25000 20000 15000 10000 Staggering: Recovery Latency 1 2 3 4 5 6 7 8 9 10 Stagger b=10 burst=50 burst=100 Uniform Stagger of i: Encode every ith packet Stagger 6, burst of 100 packets = 90% recovered at 50 ms!

: Overview Motivation Time-Critical Datacenters: large numbers of low-rate groups bursty end-host loss patterns is the first protocol to scale in the number of groups in the system

Outline Motivation 1 Motivation 2 3

Open Problem: LambdaNets The Lambda Internet: A collection of geographically dispersed datacenters...... connected by optical lambda links Applications need to run over LambdaNets: Financial services operating in different markets MNCs with operations in different countries High-volume e-tailers

Why is this hard? Motivation Speed of Light! Existing systems are not designed for very high communication latencies: Try executing a Java RMI call on a server sitting in India... Or mirroring your Oracle database to Kentucky... Need for fundamental redesign of software stack

Data Transport for the Lambda Internet TCP/IP uses RTT-based timeouts and retransmissions...... hundreds of milliseconds to recover lost packets! FEC: Perfect technology for long-distance transfer...... but useless if loss is bursty. : Decorrelated FEC Constructs repair packets from across multiple outgoing channels from one datacenter to another

Datacenters are the present (and future) The applications you build will run on Datacenters Current technology works... barely. Next-generation applications will push the limits of scalability: What if all TV is IP-based (YouTube on steroids)? What if all your data/functionality is remote? (AJAX-based Apps...) What if everything is remote? (Web-based Operating...)