Micro load balancing in data centers with DRILL

Micro load balancing in data centers with DRILL Soudeh Ghorbani (UIUC) Brighten Godfrey (UIUC) Yashar Ganjali (University of Toronto) Amin Firoozshahian (Intel)

Where should the load balancing functionality live?

Why load balancing in data centers?

Data center apps have demanding network requirements.

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google s Datacenter Network, Google, SIGCOMM 2015. Data center topologies provide high capacity. Core 10.4.0.1 10.4.1.1 10.4.3.1 Aggregation 10.0.2.1 10.0.3.1 10.2.2.1 Edge 10.0.1.1 10.2.0.1 10.0.1.2 10.0.1.3 Pod 0 10.2.0.2 10.2.0.3 Pod 1 Pod 2 Pod 3 VL2: a scalable and flexible data center network, C. Kim et al., SIGCOMM 2009 A scalable, commodity data center network architecture, M. Al-Fares et al., SIGCOMM 2008 Jellyfish: Networking Data Centers Randomly., A. Singla et al., NSDI 2012

But we are still not using the capacity efficiently! Networks experience high congestion drops as utilization approached 25% [1]. Further improving fabric congestion response remains an ongoing effort [1]. [1] Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google s Datacenter Network, Google, SIGCOMM 2015.

The gap: High bandwidth provided via massive multipathing. Balancing load among many paths in real time seems too hard for our fast and dumb data center fabric. Congestion happens even when there is spare capacity to mitigate it elsewhere [2]. [2] Network Traffic Characteristics of Data Centers in the Wild, T. Benson et al., IMC 2010.

ECMP is not the answer. Select among equal-cost paths by a hash of 5-tuple Problems: Coarse grained Stateless and local A B D C

Rethinking the problem:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Plank [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14]

Rethinking the problem Let s move the load balancing functionality out of the core!

Central Controller Moving LB from fabric to:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Planck [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14] Controller

Hosts Moving LB from fabric to:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Planck [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14] Controller

Edge Moving LB from fabric to:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Planck [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14] Controller

Packet LB granularity Flowcell/ flowlet Flow Micro bursts Elephant flow Scalable LB design space Hedera Planck ECMP CONGA Presto O(µs) O(1ms) O(10ms) O(100ms) O(1s) Oblivious Control loop latency

Packet LB granularity Flowcell/ flowlet Flow Elephant flow Micro load balancing Microsecond timescales Microscopic, i.e., local to each switch port O(µs) O(1ms) O(10ms) O(100ms) O(1s) Control loop latency

Micro LB A plausible architecture Symmetric topologies dst port A 4,5,6 B 4,5,6 C 4,5,6 Fabric 1. Discover topology. 2. Compute multiple shortest paths. 3. Install into FIB ~ ECMP, so far...

Inside a switch dst port A 4,5,6,7 B 4,5,6,7 C 4,5,6

The power of 2 choices Random n bins and n balls Each ball choosing a bin independently and uniformly at random Max load: Optimal log n θ( log log n ) Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing, 1994.

The power of 2 choices n bins and n balls Balls placed sequentially, in the least loaded of d 2 random bins Max load: log log n θ( log d ) d=1 d=2 Optimal Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing, 1994.

What we want: Queues instead of bins. Each ball chooses a bin independently, no coordination.

What we want: Queues instead of bins. Each ball chooses a bin independently, no coordination. dropped

Simulation methodology OMNET++, INET framework Linux 2.6 TCP Leaf/spine topologies Datacenter traces from DevoFlow [SIGCOMM 11]

Queue lengths [STDV] The pitfalls of choice d=1 Balls and bins d=2 d=40 Setting parameters Stability d

Packet LB granularity Flowcell/ flowlet Flow Elephant flow DRILL Decentralized Randomized In-network Local Load balancing O(µs) O(1ms) O(10ms) O(100ms) O(1s) Control loop latency

Substantial improvement over prior work.

Substantial improvement over prior work. An incast application

Thou shalt not split flows! Splitting flows Reordering Throughput degradation

Splitting flows Reordering Throughput degradation DRILL splits flows along many paths USED PATHS 120% 100% 97% 98% 80% 73% 60% 40% 20% 0% DRILL Presto Random

Splitting flows Reordering Throughput degradation DRILL splits flows along many paths yet causes little reordering! USED PATHS OUT OF ORDER PACKETS 120% 100% 97% 98% 0.30% 0.25% 0.24% 80% 73% 0.20% 60% 0.15% 40% 0.10% 20% 0.05% 0.02% 0.04% 0% DRILL Presto Random 0.00% DRILL Presto Random

Splitting flows Reordering Throughput degradation DRILL splits flows along many paths yet causes little reordering! USED PATHS BUFFERING DELAY [MICRO SEC] 120% 25 22 100% 97% 98% 20 17 80% 73% 15 60% 10 40% 5 5 20% 0% DRILL Presto Random 0 DRILL Presto Random

Splitting flows Latency variance Reordering Throughput degradation

Splitting flows Latency variance Reordering Throughput degradation 00% 0% DRILL splits flows along many paths U S E D PAT H S 97% 73% 98% DRILL Presto Random 1800 1600 1400 1200 1000 800 600 400 200 0 1.22 but has low queueing delay variance QUEUE LENGTH [ STDV] 1671.21 and therefore causes little reordering! 295.73 DRILL Presto Random Insight: Queueing delay variance is so small that it doesn t matter what path the packet takes. 30 20 10 0 B U F F E R I NG D E L AY [ M I C R O S E C ] 5 22 17 DRILL Presto Random

Ongoing and future work Efficient handling of asymmetry Failures Irregular topologies with non-equal cost paths Converged Ethernet

Micro Load Balancing with DRILL: Conclusion Microscopic, microsecond decisions yield lowest latency load balancing Splitting flows is splitting hairs Strong candidate for augmenting data center switching hardware