CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, fek, Schwarz, 2003 BGP-RCN: Improving Convergence Through Root Cause Notification Pei, zuma, Massey, Zhang, 2005 Spring 2013
Z s Candidate paths: () (B ) (C () B ) (E () D B ) (H (I H G G F E F ) BGP Path Exploration Z ( ) ( ) ( ) C B F E D ( ) ( ) ( ) dest. I H G n Obsolete paths (C B ) and (E D B ) explored before converging on valid path (I H G F )
Potential to Explore N! Paths Paths Explored by D B C S,C,S Link C,S fails,b,c,s,d,c,s,b,d,c,s,d,b,c,s. No route Theoretically can explore N! paths before no route
Some Routing Terminology Tup = route to previously unreachable prefix is announced. Tdown = route to current reachable prefix is withdrawn and no replacement exists Tshort = route to current reachable prefix switches to shorter path Tlong = route to current reachable prefix switches to a longer path Other terminology Tdown = fail-down Tlong = fail-over
BGP MRI Time and Convergence n Minimum Route dvertisement Interval (MRI) timer: n Within M=30 seconds, at most one announcement from to B n not for the first announcement, not for the withdrawal n Impact: a. suppress transient changes b. delay convergence s path changes: P 1 w P 2 P 3 P 4 P 5 time=0 Msgs from to B: P 1 w time=30 time=60 P 4 P 5
[BS03] Improving BGP Convergence Objective: Improve convergence time after a legitimate route change. pproach: Flush out ghost information that is blocked by the MRI timer P4 in previous slide is ghost information Contributions: Simple, easily deployed, and clever approach to improve convergence Theoretical understanding of convergence behavior Improves on 2002 result from Pei et al.
Basic Model Each S is treated as one node Though not strictly required in ghost flushing Routers use shortest path routing policy Helps with analysis, but not strictly required SPVP (simple path vector protocol) approximates BGP MRI timer between updates Minimum Route dvertisement Interval Two consecutive updates must be at least MRI time apart.
Ghost Information Obsolete path information stored at node Could be preferred route or backup route stored at a node. MRI timer can block removal of ghost information Router cannot announce its current choice of paths because it recently announced a different path. Typical MRI value is 30 seconds Can lead to increased convergence time and increased chance of selecting ghost paths.
Ghost Flushing Very Simple Rule for BGP Routers When route to P is updated to a worse path and MRI timer is delaying path announcement send withdraw(p) (no route to P)
Path Length and Time ssume Tdown Event Let H = message passing time Claim at time K*H, every message or node has SPath length > K By induction. True at time H since neighboring routers received withdraw ssume true at time KH, all paths longer than KH. Suppose K or less path exists at time (K+1)H Must have come from some peer P with path length KH. Path must have been removed prior to time KH Withdraw or longer path announced prior to time KH Must be received prior to time (K+1)H (contradiction)
Implications of Time/Length Shown that at the K*H, every message or node has SPath length > K Implications: Longest possible path has length N t time N*H, all paths are longer than longest possible path By time N*H, all routers know that path is withdrawn Convergence time is (N*H) Reduced from N*MRI
Message Complexity Claim at most 2 messages sent during each MRI timer interval Resulting complexity Number of MRI rounds is NH/(MRI) Updates per round is 2E Complexity is O(2ENH/MRI) (BGP complexity is EN)
Tlong (fail-over) Complexity Expect good results, but no theoretical results presented here Simulations show solid improvement Other simulations (not shown here) show some surprises Theoretical results later determined by Pei et al. Covered next week.
[P+05] Improving BGP Convergence Objective: Improve convergence time after a legitimate route change. pproach: Signal the cause of the path failure Contributions: Dramatic reduction in convergence time plus ability to improve other parts of BGP Theoretical understanding of convergence behavior
BGP Path Exploration Revisited Z s Candidate paths: () (B ) (C () B ) (E () D B ) (H (I H G G F E F ) Z ( ) ( ) ( ) C B F E D ( ) ( ) ( ) dest. I H G n Observation: if Z know [B ] failed, it could ve avoided the obsolete paths
Root Cause Notification The node who detects the failure attaches root cause to msg Other nodes copy the root cause to outgoing messages n the first msg is enough for Z to remove all the obsolete paths Z s Candidate paths: () (B ) (C B ) (E D B ) ) (H (I H G G F E F ) Z ( ), [B ] failure ( ), [B ] failure C E D B ( ), [B ] failure F I H G
Overlapping Events nother topology change happens before the previous change s convergence finishes. [B ] failure Z B E D [B ] failure dest. Propagation along lower path is slower than upper path
Overlapping Events Path: (B ) [B ] recovery Z B E D [B ] recovery dest. [B ] failure
Wrong! Path: (B ) Overlapping Events Observation: need to order the relative timing of the root causes Z B E D dest. [B ] failure [B ] recovery
Solution: adding sequence number Node B maintains a sequence number for link [B ] Incremented each time the link status changes [B ] failure, seqnum=1 Z B E D [B ] failure, seqnum=1 dest.
Solution: adding sequence number (B ), [B ] recovery, seqnum=2 Path: (C B ), seqnum of [B ]=2 Z B E D dest. [B ] failure, seqnum=1
Solution: adding sequence number Sequence number orders the relative timing of the root causes Path: (B ), seqnum of [B ]=2 Z B E [B ] failure, seqnum=1 D dest.
Evaluation: analysis and simulation n Two types of topology changes: dest. n Fail-over: nodes switch to worse paths Z I H B G F n Fail-down: destination becomes unreachable dest.
Z Fail-down convergence delay (worst case) bound Withdrawals are not delayed by MRI! w C h seconds w long shortest path: it takes at most d*h seconds B h seconds d: network diameter w nodal processing delay dest. d << N-1 and h <<M RCN BGP d * h (N-1) * (h+m) Length of the longest possible path(n) MRI value
Fail-down simulation results n 2-3 orders of magnitudes reduction Convergence Time 1000 Seconds 100 10 BGP RCN 1 14 28 56 112 Number of nodes
Border nodes in fail-over convergence unaffected nodes Z s eventual path has always been available H I J D Z ffected nodes B C dest. Border node Z: connected to an unaffected node H its eventual path is through H
RCN s fail-over delay bound H First message unaffected is nodes not delayed by MRI! D Phase 2: (M+h)* d affected ffected nodes Node D s convergence: Z Phase 1: h*d affected B C dest. Phase 1: Z receives the root cause Phase 2: Z s path is propagated to D (MRI delay applies in this phase) diameter of the sub-graph of affected nodes RCN (M + 2*h)*d affected
BGP s fail-over delay bound H unaffected nodes D Phase 2: (M+h)* d affected ffected nodes Node D s convergence: Z B C dest. Phase 1: Z explores paths shorter than Z s eventual path Phase 2: the same as in RCN BGP (M+h) * min{d - J, V affected + d affected -1}
Fail-over simulation results n BGP does fine : <(M+h) * d n d : 2~6 d : length of the longest path from any affected node to the destination Seconds 25 20 15 10 5 0 14 28 56 112 Number of nodes BGP RCN Constructed topologies with large d : RCN has much more pronounced improvement
RCN Overhead n Transmission & storage of a path: doubled path:seqnum (Z C B ):(3 2 2 1) n Storage overhead in the routing table: n doubled n Transmission overhead reduced n 1~2 orders of magnitudes reduction in msg counts
Related Work n Reducing negative impact of MRI: n [Griffin:ICNP01], Ghost-Flushing [Bremler-Barr:Infocom03] n don t deal with path exploration n Reducing path exploration n Consistency ssertion [Pei:Infocom02] n path exploration still exists n Explicitly signaling failure n RCO [Luo:Globecom02], BGP-CT [Wattenhofer:talkslides03]: may result in wrong routing decision n EPIC [Chandrasheka:Infocom05]: encoding difference