The Impact of Router Outages on the AS-Level Internet Matthew Luckie* - University of Waikato Robert Beverly - Naval Postgraduate School *work started while at CAIDA, UC San Diego SIGCOMM 2017, August 24th 2017 www.caida.o 1
Internet Resilience Where are the Single Points of Failure? CE CE CE PE PE PE Example #A Example #B CE: Customer Edge PE: Provider Edge 2
Internet Resilience Where are the Single Points of Failure? CE If the CE router fails, the network is disconnected, so the CE router is a PE Single Point of Failure (SPoF) PE Example #A CE: Customer Edge PE: Provider Edge 3
Internet Resilience Where are the Single Points of Failure? CE CE If the CE router fails, the network has an PE alternate path available, so the CE router is NOT a Single Point of Failure (SPoF) Example #B CE: Customer Edge PE: Provider Edge 4
Internet Resilience Where are the Single Points of Failure? CE CE If the PE router fails, PE the customer network is disconnected, so the PE router is a Single Point of Failure (SPoF) CE: Customer Edge Example #B PE: Provider Edge 5
Challenges in topology analysis Prior approaches analyzed static AS-level and router-level topology graphs, - e.g.: Nature 2000 Important AS-level and router-level topology might be invisible to measurement, such as backup paths, - e.g: INFOCOM 2002 A router that appears to be central to a network s connectivity might not be - e.g.: AMS 2009 6
What we did Large-scale (Internet-wide) longitudinal (2.5 years) measurement study to characterize prevalence of Single Points of Failure (SPoF): 1. Efficiently inferred IPv6 router outage time windows 2. Associated routers with IPv6 BGP prefixes 3. Correlated router outages with BGP control plane 4. Correlated router outages with data plane 5. Validated inferences of SPoF with network operators 7
What we did Identified IPv6 router interfaces from traceroute 83K to 2.4M interfaces from CAIDA s Archipelago traceroute measurements 8
What we did probed router interfaces to infer outage windows We used a single vantage point located at CAIDA, UC San Diego for the duration of this study 9
Central counter: 9290 What we did 10
Central counter: 9290 9291 What we did T1: 9290 9290 10
Central counter: 9290 9291 9292 What we did T1: 9290 T2: 9291 9291 10
Central counter: 9290 9291 9292 9293 What we did T1: 9290 T2: 9291 9292 T3: 9292 10
Central counter: 9290 9291 9292 9293 9294 What we did T1: 9290 T2: 9291 9293 T3: 9292 T4: 9293 10
Central counter: 9290 9291 9292 9293 9294 9295 What we did T1: 9290 T2: 9291 9294 T3: 9292 T4: 9293 T5: 9294 10
Central counter: 9290 9291 9292 9293 9294 9295 1 What we did Reboot! T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 10
Central counter: 9290 9291 9292 9293 9294 9295 12 What we did T1: 9290 T2: 9291 1 T3: 9292 T4: 9293 T5: 9294 T6: 1 10
Central counter: 9290 9291 9292 9293 9294 9295 12 3 What we did T1: 9290 T2: 9291 2 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 10
Central counter: 9290 9291 9292 9293 9294 9295 12 34 What we did T1: 9290 T2: 9291 3 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 T8: 3 10
What we did probed router interfaces to infer outage windows using IPID T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 Outage Window T7: 2 T8: 3 Infer a reboot when time series of values returned from a router is discontinuous, indicating router was restarted 11
Why IPv6 fragment IDs? IPv4 Fragment IDs: - 16 bits, bursty velocity: every packet requires unique ID - At 100Mbps and 1500 byte packets, Nyquist rate dictates 4 second probing interval IPv6 Fragment IDs: - 32 bits, low velocity: IPv6 routers rarely send fragments - We average 15 minute probing interval 12
What we did correlated routers with prefixes using traceroute paths 13
2001:db8:2::/48 What we did correlated routers with prefixes Ark VP using traceroute paths 2001:db8:1::/48 50-60 Ark VPs traceroute every routed IPv6 prefix every day Ark VP 14
2001:db8:2::/48 What we did correlated routers with prefixes Ark VP using traceroute paths 2001:db8:1::/48 50-60 Ark VPs traceroute every routed IPv6 prefix every day Ark VP 14
2001:db8:2::/48 What we did computed distance of Ark VP router from AS announcing network 0 (CE) 2 1 (PE) 2001:db8:1::/48 CE: Customer Edge PE: Provider Edge 15
2001:db8:2::/48 What we did correlated router outage windows with BGP control plane 0 (CE) 2001:db8:1::/48 16
2001:db8:2::/48 What we did correlated router outage windows with BGP control plane T1: 9290 T2: 9291 T3: 9292 T4: 9293 Outage Window T5: 9294 T6: 1 T7: 2 T8: 3 2001:db8:1::/48 17
2001:db8:2::/48 What we did correlated router outage windows with BGP control plane RouteViews Outage Window T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 2001:db8:2::/48 T5.2: Peer-1 W T5.2: Peer-2 W T5.3: Peer-3 W T5.3: Peer-4 W T5.8: Peer-3 A T5.8: Peer-2 A T8: 3 2001:db8:1::/48 T5.8: Peer-1 A T5.8: Peer-4 A 18
What we did classified impact on BGP according to observed activity overlapping with inferred outage Complete Withdrawal: all peers simultaneously withdrew route for at least 70 seconds - Single Point of Failure (SPoF) Partial Withdrawal: at least one peer withdrew route for at least 70 seconds, but not all did Churn: BGP activity for the prefix No Impact: No observed BGP activity for the prefix 19
What we did Data Collection Summary Probed IPv6 routers at ~15 minute intervals from 18 Jan 2015 to 30 May 2017 (approx. 2.5 years) 149,560 routers allowed reboots to be detected We inferred 59,175 (40%) rebooted at least once,750k reboots in total 1 0.8 CDF 0.6 0.4 0.2 0 1 10 100 Number of Outages 20
What we found 2,385 (4%) of routers that rebooted (59K) we inferred to be SPoF for at least one IPv6 prefix in BGP Of SPoF routers, we inferred 59% to be customer edge router; 8% provider edge; 29% within destination AS No covering prefix for 70% of withdrawn prefixes - During one-week sample, covering prefix presence during withdrawal did not imply data plane reachability IPv6 Router reboots correlated with IPv4 BGP control plane activity 21
Limitations Applicability to IPv4 depends on router being dual-stack Requires IPID assigned from a counter - Cisco, Huawei, Vyatta, Microtik, HP assign from counter - 27.1% responsive for 14 days assigned from counter Router outage might end before all peers withdraw route - Path exploration + Minimum Route Advertisement Interval (MRAI) + Route Flap Dampening (RFD) Complex events: multiple router outages but one detected - We observed some complex events and filtered them out 22
Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 = Validated Inference = Incorrect Inference? = Not Validated 23
Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 Challenging to get validation data: operators often could only tell us about the last reboot 24
Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 No falsely inferred reboots: we correctly observed the last known reboot of each router 25
Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 We did not detect some SPoFs 26
Number of Interfaces 3M 1M 100K 30K Data Collection Summary 10K Jan 15 23.5K All Incrementing Jul 15 (a) 83K Jan 16 46.5K Jul 16 ~1.1M (b) 79.8K (c) 41.8K 15.2K Jan 17 PPS List Unresponsive (a) 100 Static 83K 12-24 hours (b) 225 Static 1.1M 12-24 hours (c) 200 Dynamic, ~2.4M 7-14 days 27
Correlating BGP/router outages Control: six hours prior to inferred outages, Feb 2015 Fraction of Reboot/Prefix Pairs 0.5 0.4 0.3 0.2 0.1 Churn Partial Withdrawal Complete Withdrawal Outside Dest. AS Inside Dest. AS 0 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 Distance of Router from Destination AS (IP hops) 28
Correlating BGP/router outages During the inferred outages, Feb 2015 Fraction of Reboot/Prefix Pairs 0.5 0.4 0.3 0.2 0.1 Churn Partial Withdrawal Complete Withdrawal Outside Dest. AS Inside Dest. AS 0 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 Distance of Router from Destination AS (IP hops) 29
BGP Prefix Withdrawals: SPoF 1 CDF 0.8 0.6 0.4 0.2 min max 0 1 min 5 15 30 1 2 4 min min min hr hr hr Complete Withdrawal Duration 8 hr 16 hr 44% less than 5 minutes, suggestive of router maintenance or router crash 30
SPoF prefixes mostly single homed Especially Router hop distance 0 0.2 Fraction of Population 0.4 0.6 0.8 1 SPoFs outside 3 destination AS, 2 as expected PE 1 CE 0 1 2 3 Prefix announced through a single upstream Prefix announced through multiple upstreams 31
Impact on IPv4 prefixes in BGP 1 Cumulative Fraction Router Outages 0.9 0.8 0.7 0.6 0.5 0.4 Before Outage During Outage 0 0.2 0.4 0.6 0.8 1 Control Outage Withdrawn Peers/Advertising Peers We examined IPv4 prefixes for 5% sample of reboots. 19% of correlated IPv4 prefixes withdrawn by at least 90% of peers during router outage window. 32
Summary Step towards root-cause analysis of inter-domain routing outages and events Fraction of Reboot/Prefix Pairs 0.5 Churn Partial Withdrawal Complete Withdrawal 0.4 0.3 0.2 0.1 0 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 Distance of Router from Destination AS (IP hops) - Explore applicability of method to measurement of other critical Internet infrastructure: DNS, Web, Email In our 2.5 year sample of 59K routers that rebooted - 4% (2.3K) were SPoF - SPoF were mostly confined to the edge: 59% customer edge We released our code as part of scamper https://www.caida.org/tools/measurement/scamper/ 33
Backup Slides 34
Impact on IPv4 Services censys.io April 2017 Active Hosts 39,107 HTTP 25,592 HTTPS 16,321 } Web SSH 11,277 DNS 7,922 IMAP 5,127 } Email SMTP 7,383 We examined IPv4 prefixes for 5% sample of reboots where at least 90% of peers during router outage window. 35
Partial Withdrawals CDF 1 0.8 0.6 0.4 0.2 0 50% of pairs had 1 2 peers withdraw 10% of pairs had nearly all peers withdraw 0 0.2 0.4 0.6 0.8 1 Fraction of Peers Withdrawing Route 50% of pairs had 1-2 peers withdraw prefix 10% of pairs had nearly all peers withdraw prefix 36
Degrees of ASes monitored Cumulative Fraction of Rebooting ASes 1 0.8 0.6 0.4 0.2 0 Single Points of Failure Monitored Population 1 10 100 1000 AS Degree ASes that were inferred to have a SPoF were disproportionately low-degree ASes 37
Activity for IPv4 prefixes in BGP Cumulative Fraction Router Outages 1 0.8 0.6 0.4 0.2 0 Before Outage During Outage 0 0.2 0.4 0.6 0.8 1 Peers Sending Updates/Total Peers At least 70% of peers reported BGP activity on IPv4 prefixes for 50% of the inferred router outages 38
Reboot Window Durations CDF 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 min 5 min min max 15 min 30 min 1 hr 2 hr Reboot Window Duration 4 hr 8 hr 16 hr Half the maximum reboot lengths were less than 30 minutes (~two probing rounds) 39
Router + BGP outage correlation Router IP-ID Sequence: 10, 11, 12 1, 2, 3 Outage Window BGP Sequence: Withdraw-Contained W A Outage-Contained W A Withdraw-Before W A Announce-After W A 40
Data processing pipeline Uptime Prober rtr targets CAIDA IPv6 Topology <ip,time,ipid> Cassandra Inferred Reboots BGP Correlation AS border distance single points of failure Route Views <peer,time,prefix> 41
Inferring router position Provider Edge (PE) Router Customer Edge (CE) Router AS X AS Y x 1 R 1 x 2 R 2 x 3 R 3 y 1 R 4 y 2 R 5 2 1 0-1 -2 (a) interface addresses routed by Y appear in traceroute x 1 AS X AS Y x x 2 3?? R R 1 2 R R R 3 4 5 2 1 0 (b) no interface addresses routed by Y appear in traceroute 42
Data Collection Summary 18 Jan 15 18 Oct 16 (a) 18 Oct 16 24 Feb 17 (b) 24 Feb 17 30 May 17 (c) Probing rate 100 pps 225 pps 200 pps Interfaces 83K seen Dec 14 1.1M seen Jun to Oct 16 Dynamic. 2.4M in May 17 Responsive every round ~15 mins every round ~15 mins every round ~15 mins Unresponsive 12-24 hours 12-24 hours 7-14 days 43
Why IPv6 fragment IDs? IPv4 ID values are 16 bits with bursty velocity as every packet requires a unique value. Ver HL DSCP length identification offset TTL protocol checksum source address destination address At 100Mbps and 1500 byte packets. Nyquist rate dictates a 4 second probing interval 44
Why IPv6 fragment IDs? IPv6 ID values are 32 bits with low velocity as systems rarely send fragmented packets. Ver DSCP flow id payload length protocol TTL source address destination address protocol reserved offset identification 45
Soliciting IPv6 Fragment IDs echo request, 1300 bytes echo reply, 1300 bytes packet too big, MTU 1280 echo request, 1300 bytes echo reply, 1280 bytes Fragment ID: 12345 46