Traffic Engineering with Forward Fault Correction Harry Liu Microsoft Research 06/02/2016 Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter 1
Cloud services require large network capacity Cloud Applications Growing traffic Cloud Networks Expensive (e.g. cost of WAN: $100M/year) 2
TE is critical to effectively utilizing networks Traffic Engineering (centralized & SDN-Based) WAN Network Datacenter Network Microsoft SWAN (SIGCOMM 13) Google B4 (SIGCOMM 13) Devoflow (SIGCOMM 11) MicroTE (CoNEXT 11) 3
Centralized TE is the key to network efficiency Sub-optimal resource allocation based on local view & control. Demand=10 Demand=10 1) how much traffic to admit 2) how to route Link Cap: 10 10 5 1 4 2 5 3 10 Requirement: path length 2 hops Total throughput: 15 Optimal resource allocation based on global view & control. 10 Link Cap: 10 2 1 4 10 10 Total throughput: 20 TE controller 3 4
But, centralized TE is also vulnerable to faults TE controller Network view (e.g. topo, cap, traffic) Frequent updates for high utilization (e.g. per 5min) Network configuration (e.g. routes, rate limits) Data-plane faults Network Control-plane faults 5
Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) s2 link 7 failure s2 link failure 3 10 s1 s3 s4 (10) 3 s3 7 Link Capacity: 10 s4 s1 congestion 3 s3 7 s4 6
Control plane faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage Control plane faults can also result in congestion. 7
The TE controllability is undermined by faults Network view Inaccuracy TE controller Network configuration Incompleteness Data-plane faults Network Control-plane faults 8
Control and data plane faults in practice In a production WAN network (200+ routers, 6000+ links): Faults are common. Faults cause severe congestion. Data plane: fault rate = 25% per 5 minutes. Control plane: fault rate = 0.1% -- 1% per TE update. 9
State of the art for handling faults Heavy over-provisioning: Big loss in throughput Reactive handling of faults: Control plane faults: retry Data plane faults: re-compute TE and update networks Cannot prevent congestion Slow (seconds -- minutes) Blocked by control plane faults 10
How about handling faults proactively? TE Algorithm making it robust Network not robust enough 11
Forward fault correction (FFC) in TE [Bad News] Individual faults are unpredictable. [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding Network faults FFC guarantees no congestion under up to k arbitrary faults. with careful traffic distribution 12
Example: FFC for link failures s2 s4 (9) s1 s3 s4 (9) Link Capacity: 10 s2 1 1 s3 K=1 (FFC) 8 8 s4 Failure Cases s2 link failure s1 1 s3 9 8 s4 s1 s2 link failure s3 9 9 s4 s1 s2 9 1 s3 link failure 8 s4 13
Trade-off: network efficiency v.s. robustness Non-FFC (Throughput: 30) FFC (k=1) (Throughput: 18) Non-FFC (Throughput: 18) s1 1 8 s1 s2 15 5 s3 link failure 10 There exists a trade-off between throughput and robustness Achieving the s1 optimal throughput s4 with FFC guarantee s1 s2 5 5 s3 s2 1 s3 s2 4 4 s3 10 10 8 5 5 s4 s4 FFC does not always sacrifice efficiency for robustness s1 s2 9 4 s3 s4 link failure 5 s4 14
Systematically realizing FFC in TE Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently? 15
Basic TE linear programming formulations TE decisions: TE objective: Sizes of flows Traffic on paths Maximizing throughput LP formulations max. b f l f,t f b f Basic TE constraints: Deliver all granted flows No overloaded link s.t. f: t l f,t b f e: f t e l f,t c e FFC constraints: No overloaded link up to k c control plane faults k e link failures k v switch failures 16
Formulating data-plane FFC flow size: s f S bw allocation: a 1 a 2 path-1 Paths are link-disjoint. path-2 D Fault on path-1: s f a 2 +a 3 FFC k=1 Fault on path-2: s f a 1 +a 3 Fault on path-3: s f a 1 +a 2 a 3 3 2 path-3 Lemma: FFC is achieved when path-i s weight is a i /a 1 +a 2 +a 3 17
An efficient and precise solution to FFC k-sum linear constraint group (k-sum group): Given n paths and A = {a 1, a 2,, a n }, FFC requires that the sum of arbitrary n-k elements in A is flow size O( n k ) FFC-TE LP-formulation: TE Objective Basic TE Constraints k-sum group-1 k-sum group-n FFC Constraints (too many) Lossless compression of a k-sum group: O( n k ) O(kn) O(n) bubble sorting network (SIGCOMM 2014) strong duality (MSR TR 2016) http://www.hongqiangliu.com/publications.html 18
FFC extensions Differential protection for different traffic priorities Minimizing congestion risks without rate limiters Control plane faults on rate limiters Uncertainty in current TE Different TE objectives (e.g. max-min fairness) 19
Implementation & evaluation highlights Testbed experiment (8 switches & 30 servers) FFC can be implemented in commodity switches FFC has no data loss due to congestion under faults Large-scale simulation A WAN network with O(100) switches and O(1000) links One-week traffic trace Fault injection according to real failure trace Results: with negligible throughput loss, FFC can reduce data loss by a factor of 7-130 in well-provisioned networks data loss of high priority traffic to almost zero in well-utilized networks 20
Conclusion and future work Network view SDN Controller Network configuration FFC Network Properties: High throughput No congestion Security Availability Connectivity Network Network Faults: Data-plane Control-plane Misconfigurations Attacks Traffic spikes 21
Q&A 22