Micro load balancing in data centers with DRILL Soudeh Ghorbani (UIUC) Brighten Godfrey (UIUC) Yashar Ganjali (University of Toronto) Amin Firoozshahian (Intel)
Where should the load balancing functionality live?
Why load balancing in data centers?
Data center apps have demanding network requirements.
Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google s Datacenter Network, Google, SIGCOMM 2015. Data center topologies provide high capacity. Core 10.4.0.1 10.4.1.1 10.4.3.1 Aggregation 10.0.2.1 10.0.3.1 10.2.2.1 Edge 10.0.1.1 10.2.0.1 10.0.1.2 10.0.1.3 Pod 0 10.2.0.2 10.2.0.3 Pod 1 Pod 2 Pod 3 VL2: a scalable and flexible data center network, C. Kim et al., SIGCOMM 2009 A scalable, commodity data center network architecture, M. Al-Fares et al., SIGCOMM 2008 Jellyfish: Networking Data Centers Randomly., A. Singla et al., NSDI 2012
But we are still not using the capacity efficiently! Networks experience high congestion drops as utilization approached 25% [1]. Further improving fabric congestion response remains an ongoing effort [1]. [1] Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google s Datacenter Network, Google, SIGCOMM 2015.
The gap: High bandwidth provided via massive multipathing. Balancing load among many paths in real time seems too hard for our fast and dumb data center fabric. Congestion happens even when there is spare capacity to mitigate it elsewhere [2]. [2] Network Traffic Characteristics of Data Centers in the Wild, T. Benson et al., IMC 2010.
ECMP is not the answer. Select among equal-cost paths by a hash of 5-tuple Problems: Coarse grained Stateless and local A B D C
Rethinking the problem:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Plank [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14]
Rethinking the problem Let s move the load balancing functionality out of the core!
Central Controller Moving LB from fabric to:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Planck [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14] Controller
Hosts Moving LB from fabric to:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Planck [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14] Controller
Edge Moving LB from fabric to:... Hedera [NSDI'10] Mahout [INFOCOM'11] FastPass [SIGCOMM'14] Planck [SIGCOMM'14] Presto [SIGCOMM'15] MPTCP [NSDI 11] CONGA [SIGCOMM'14] Controller
Packet LB granularity Flowcell/ flowlet Flow Micro bursts Elephant flow Scalable LB design space Hedera Planck ECMP CONGA Presto O(µs) O(1ms) O(10ms) O(100ms) O(1s) Oblivious Control loop latency
Packet LB granularity Flowcell/ flowlet Flow Elephant flow Micro load balancing Microsecond timescales Microscopic, i.e., local to each switch port O(µs) O(1ms) O(10ms) O(100ms) O(1s) Control loop latency
Micro LB A plausible architecture Symmetric topologies dst port A 4,5,6 B 4,5,6 C 4,5,6 Fabric 1. Discover topology. 2. Compute multiple shortest paths. 3. Install into FIB ~ ECMP, so far...
Inside a switch dst port A 4,5,6,7 B 4,5,6,7 C 4,5,6
The power of 2 choices Random n bins and n balls Each ball choosing a bin independently and uniformly at random Max load: Optimal log n θ( log log n ) Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing, 1994.
The power of 2 choices n bins and n balls Balls placed sequentially, in the least loaded of d 2 random bins Max load: log log n θ( log d ) d=1 d=2 Optimal Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Computing, 1994.
What we want: Queues instead of bins. Each ball chooses a bin independently, no coordination.
What we want: Queues instead of bins. Each ball chooses a bin independently, no coordination. dropped
Simulation methodology OMNET++, INET framework Linux 2.6 TCP Leaf/spine topologies Datacenter traces from DevoFlow [SIGCOMM 11]
Queue lengths [STDV] The pitfalls of choice d=1 Balls and bins d=2 d=40 Setting parameters Stability d
Packet LB granularity Flowcell/ flowlet Flow Elephant flow DRILL Decentralized Randomized In-network Local Load balancing O(µs) O(1ms) O(10ms) O(100ms) O(1s) Control loop latency
Substantial improvement over prior work.
Substantial improvement over prior work. An incast application
Thou shalt not split flows! Splitting flows Reordering Throughput degradation
Splitting flows Reordering Throughput degradation DRILL splits flows along many paths USED PATHS 120% 100% 97% 98% 80% 73% 60% 40% 20% 0% DRILL Presto Random
Splitting flows Reordering Throughput degradation DRILL splits flows along many paths yet causes little reordering! USED PATHS OUT OF ORDER PACKETS 120% 100% 97% 98% 0.30% 0.25% 0.24% 80% 73% 0.20% 60% 0.15% 40% 0.10% 20% 0.05% 0.02% 0.04% 0% DRILL Presto Random 0.00% DRILL Presto Random
Splitting flows Reordering Throughput degradation DRILL splits flows along many paths yet causes little reordering! USED PATHS BUFFERING DELAY [MICRO SEC] 120% 25 22 100% 97% 98% 20 17 80% 73% 15 60% 10 40% 5 5 20% 0% DRILL Presto Random 0 DRILL Presto Random
Splitting flows Latency variance Reordering Throughput degradation
Splitting flows Latency variance Reordering Throughput degradation 00% 0% DRILL splits flows along many paths U S E D PAT H S 97% 73% 98% DRILL Presto Random 1800 1600 1400 1200 1000 800 600 400 200 0 1.22 but has low queueing delay variance QUEUE LENGTH [ STDV] 1671.21 and therefore causes little reordering! 295.73 DRILL Presto Random Insight: Queueing delay variance is so small that it doesn t matter what path the packet takes. 30 20 10 0 B U F F E R I NG D E L AY [ M I C R O S E C ] 5 22 17 DRILL Presto Random
Ongoing and future work Efficient handling of asymmetry Failures Irregular topologies with non-equal cost paths Converged Ethernet
Micro Load Balancing with DRILL: Conclusion Microscopic, microsecond decisions yield lowest latency load balancing Splitting flows is splitting hairs Strong candidate for augmenting data center switching hardware