BGP Convergence in much less than a second Clarence Filsfils - cf@cisco.com Presented by Martin Winter - mwinter@cisco.com 1
Down Convergence T1 Down Convergence T2 Default metric = 1 Src R R 20 F Dst Link L Assume a flow from Src to Dest T1: when L dies, the best path is impacted loss of traffic T2: when the network converges, a next best path is computed traffic reaches the destination again Loss of Connectivity: T2 T1, called Down convergence hereafter Analyzed for streams going to IGP and BGP learned prefixes 2
Up Convergence T1 Up Convergence T2 Default metric = 1 Src R R 20 F Dst Link L Assume a flow from Src to Dest T1: when L comes up, a better best path is available there is no traffic loss T2: the new bestpath is applied there is no traffic loss T2-T1 is called UP convergence. This does not imply any loss of connectivity 3
Convergence Down versus UP Down convergence impacts customer service (there is loss between T2 and T1) UP Convergence does not impact customer service (there is no loss between T2 and T1 because modern router implementation supports lossless switch-over from valid to valid path) This paper focuses on BGP Convergence upon Down transition 4
BGP PIC Core 5
BGP Prefix-Independent Convergence Core failure PE1 s RIB: 10.1.1/24 via PE2 10.1.2/24 via PE2. 99.99.99/24 via PE2 When? PE1 PoS1 PoS2 PE2 PE2 is reachable via PoS1 PoS2 Upon a core failure, the IGP finds an alternate path to the BGP next-hop PE2 in a few 100 s of msec Requirement: the BGP prefixes depending on reachability to PE2 must leverage the new ISIS path as soon as it is updated in the FIB 6
The right behavior is BGP PIC/core iox-crs1-p-isis-bgpipv4-dpi2i-021307_154358-l-nlb 400 Agilent measurements The measured convergence to the BGP nhop traffic loss (msec) 350 300 250 200 150 100 50 0% 50% 90% 100% 0% 50% 90% 100% The measured convergence to the first BGP dependent to this BGP nhop The measured convergence to the middle (175k) BGP dependent to this BGP nhop The measured convergence to the last (350k) BGP dependent to this BGP nhop 0 0 1000 2000 3000 4000 5000 6000 7000 prefix nr BGP PIC effect Prefix 1 to 5000 are ISIS prefixes. The BGP nhop is an important ISIS prefix and hence is prioritized to the very start of the convergence Testbed: Tier1 ISP topology, CRS1, IOX3.5, 5000 ISIS prefixes, 350k IPv4 BGP dependents to impacted BGP nhop same behavior for VPNv4 as of 3.5 When ISIS converges, all the BGP dependents immediately leverage the ISIS convergence 7
The right architecture: hierarchical FIB BGP net BGP net BGP LDI IGP LDI oif BGP nexthop(s) IGP nexthop(s) Output Interface BGP net Pointer Indirection between BGP and IGP entries allow for immediate leveraging of the IGP convergence 8
The unoptimal way: flattened FIB BGP net BGP net oif oif BGP net IGP LDI oif oif Control Plane flattens the recursion such that any BGP FIB entry has its own local oif information 9
Hierarchical FIB - Advantages Routing Convergence: BGP PIC Core the BGP dependents converge at IGP convergence of their nhop Scaling and Robustness Smaller FIB Memory (thanks to sharing) Much less CPU requirement (no need to reflatten all BGP prefixes upon IGP change) Commercially available proven 10
BGP PIC Edge 11
BGP PIC Edge ibgp R1 ebgp B1 R3 ibgp ibgp R2 ebgp B2 Z prefixes Eg. N/n Two ISP s peer at multiple locations and exchange 350k BGP prefixes. R3, a typical PE within the grey ISP, installs each of these 350k BGP prefixes as multipath pair entries (BGP nhop B1, BGP nhop B2) We send traffic from R3 to each of the 350k IPv4 BGP prefixes of the yellow ISP. We measure the loss of connectivity upon a peering node failure (R1) The UUT (R3) is a 12k running IOX 3.3 12
350-crs1-mp-ipv4-350k-pic-edge Results 180msec! LoC (ms) 200 180 160 140 120 100 80 60 40 20 0 1 25000 Agilent measurements In 180msec, as a consequence of its IGP convergence, R3 has updated its FIB tree such that all the 350k IPv4 BGP prefixes be forwarded via the alternate path through R2 50000 The normal BGP convergence will then take place which involves R2 sending 350k of withdraws and R3 running 350k of bestpath s and RIB modifications. This process will likely take multiple tens of seconds but has no impact on the service thanks to the FIB fixup provided by BGP PIC at IGP convergence time. 75000 13 100000 100001 125000 150000 175000 200000 prefix nr 225000 225001 250000 275000 300000 325000 350000 1 2 3 4 5 6 7 8 9 10
If Non BGP PIC edge R3 R1 B1 Z prefixes Eg. N/n R2 B2 R3 receives a NHT/down and trigger on a per-prefix basis bestpath (eg. R2 is now best as R1 is gone) RIB/FIB update peer update Results: for Z = 100.000 IPv4 BGP prefixes 12k/Eng3/32S: 30 seconds Other vendors: 30 seconds as well Impossible to perform good here due to the prefix-dependency 14
Another advantage of Hierarchical FIB BGP net BGP net IGP LDI oif BGP LDI BGP net IGP LDI oif Pointer Indirection between BGP and IGP entries allow for immediate update of the multipath BGP LDI at IGP convergence 15
Assumption The failure must result into an IGP-based deletion of the BGP nhop PE node failure: its IGP neighbors detect the loss of adjacency and trigger a convergence at all remote PE s which concludes with a deletion of this node from the topology and hence a deletion of any BGP nhop on that node (its /32) or on peering links behind that node. Peering link failure: next-hop-self should not be used (the BGP nhop is on the peering link) and the IGP should run passive on the peering link. Upon peering link failure, the IGP converges and concludes with a deletion of the BGP nhop The BGP prefixes impacted by the failure must have alternate paths installed as multipath BGP entries 16
Conclusion Hierarchical FIB allows for better FIB memory scaling much lower CPU consumption higher robustness And especially BGP PIC Core BGP PIC Edge 17
Availability BGP PIC CORE 12k: IOX 3.3 (ipv4, vpnv4, ipv6) CRS: IOX 3.5 (ipv4, vpnv4, ipv6) BGP PIC EDGE - Multipath assumption E3/E5: IPv4 & IPv6 = IOX 3.3, VPNv4 = IOX 3.5 CRS: IPv4 & IPv6 & VPNv4 = IOX 3.5 18