The Impact of Router Outages on the AS-Level Internet

Similar documents
Measuring and Characterizing IPv6 Router Availability

Primitives for Active Internet Topology Mapping: Toward High-Frequency Characterization

Updates and Case Study

Analysis of Country-wide Internet Outages Caused by Censorship


Measured Impact of Tracing Straight. Matthew Luckie, David Murrell WAND Network Research Group Department of Computer Science University of Waikato

Software Systems for Surveying Spoofing Susceptibility

IPv6 Topology Mapping

Updates and Analyses

Software Systems for Surveying Spoofing Susceptibility

Internet Topology Research

Inter-Domain Routing Trends

BGP Path Exploration Damping (PED)

On the Prevalence and Characteristics of MPLS Deployments in the Open Internet

Accurate Real-time Identification of IP Hijacking. Presented by Jacky Mak

Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011

Implementing MPLS Forwarding

Lecture 18 Overview. Last Lecture. This Lecture. Next Lecture. Internet Protocol (1) Internet Protocol (2)

Table of Contents 1 System Maintaining and Debugging Commands 1-1

Network Layer (4): ICMP

MPLS VPN Explicit Null Label Support with BGP. BGP IPv4 Label Session

AS Connectedness Based on Multiple Vantage Points and the Resulting Topologies

INFERRING PERSISTENT INTERDOMAIN CONGESTION

Measuring BGP. Geoff Huston. CAIA SEMINAR 31 May

Measurement: Techniques, Strategies, and Pitfalls. David Andersen CMU

Chapter 4 Network Layer: The Data Plane. Part A. Computer Networking: A Top Down Approach

Dynamics of Hot-Potato Routing in IP Networks

Lecture 19: Network Layer Routing in the Internet

TSIN02 - Internetworking

TSIN02 - Internetworking

Routing the Internet in Geoff Huston APNIC March 2007

Introduction to IP Routing. Geoff Huston

Organization of Product Documentation... xi

Revealing MPLS Tunnels Obscured from Traceroute

A Characterization of IPv6 Network Security Policy

ELEC / COMP 177 Fall Some slides from Kurose and Ross, Computer Networking, 5 th Edition

IRNC-SP: Sustainable data-handling and analysis methodologies for the IRNC networks

Achieving scale: Large scale active measurements from PlanetLab

Table of Contents 1 System Maintenance and Debugging Commands 1-1

CS 3516: Advanced Computer Networks

Network Layer PREPARED BY AHMED ABDEL-RAOUF

ECE 158A: Lecture 7. Fall 2015

Telecom Systems Chae Y. Lee. Contents. Overview. Issues. Addressing ARP. Adapting Datagram Size Notes

Introduction to MPLS APNIC

Introduction to Information Science and Technology 2017 Networking II. Sören Schwertfeger 师泽仁

A Technique for Reducing BGP Update Announcements through Path Exploration Damping

Lecture 5 The Network Layer part II. Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it

Border Gateway Protocol - BGP

TDTS04 Computer networks and distributed systems Final Exam: 14:00-18:00, Thursday, March 20, 2014

Introduction to MPLS. What is MPLS? 1/23/17. APNIC Technical Workshop January 23 to 25, NZNOG2017, Tauranga, New Zealand. [201609] Revision:

MAPPING INTERNET INTERDOMAIN CONGESTION

Internet Mapping Primitives

Topics for This Week

A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance

Network Layer: Control/data plane, addressing, routers

Networking: Network layer

High-frequency mapping of the IPv6 Internet using Yarrp

Introduction to IPv6. Unit -2. Prepared By:- NITIN PANDYA Assistant Professor, SVBIT.

This document is not restricted to specific software and hardware versions.

Tracing the Path to YouTube -

ELEC / COMP 177 Fall Some slides from Kurose and Ross, Computer Networking, 5 th Edition

IPv6 Switching: Provider Edge Router over MPLS

IP - The Internet Protocol. Based on the slides of Dr. Jorg Liebeherr, University of Virginia

Computer Networks CS 552

Computer Networks CS 552

network security s642 computer security adam everspaugh

Initial motivation: 32-bit address space soon to be completely allocated. Additional motivation:

Preventing the unnecessary propagation of BGP withdraws

IP Spoofer Project. Observations on four-years of data. Rob Beverly, Arthur Berger, Young Hyun.

Lecture 3. The Network Layer (cont d) Network Layer 1-1

Examination. ANSWERS IP routning på Internet och andra sammansatta nät, DD2491 IP routing in the Internet and other complex networks, DD2491

CS 3516: Computer Networks

Table of Contents Chapter 1 MPLS Basics Configuration

Taming BGP. An incremental approach to improving the dynamic properties of BGP. Geoff Huston. CAIA Seminar 18 August

Ping, tracert and system debugging commands

Dig into MPLS: Transit Tunnel Diversity

Internet Architecture and Experimentation

Chapter 12 Network Protocols

CSCI Networking Name:

IPv6 Next generation IP

BEng. (Hons) Telecommunications. Examinations for / Semester 2

BGP Routing and BGP Policy. BGP Routing. Agenda. BGP Routing Information Base. L47 - BGP Routing. L47 - BGP Routing

BGP Routing inside an AS

Chapter 09 Network Protocols

MPLS VPN Inter-AS Option AB

On the Impact of Filters on Analyzing Prefix Reachability in the Internet. Ravish Khosla, Sonia Fahmy, Y. Charlie Hu Purdue University ICCCN 2009

Vendor: Alcatel-Lucent. Exam Code: 4A Exam Name: Alcatel-Lucent Interior Routing Protocols and High Availability.

IPv6 Switching: Provider Edge Router over MPLS

CS519: Computer Networks. Lecture 2: Feb 2, 2004 IP (Internet Protocol)

NAT, IPv6, & UDP CS640, Announcements Assignment #3 released

IP Mobility Design Considerations

Using Loops Observed in Traceroute to Infer the Ability to Spoof

MPLS VPN--Inter-AS Option AB

Routing Support for Wide Area Network Mobility. Z. Morley Mao Associate Professor Computer Science and Engineering University of Michigan

The IP Data Plane: Packets and Routers

Chapter 4: Network Layer

Internet Anycast: Performance, Problems and Potential

CS 356: Computer Network Architectures. Lecture 10: IP Fragmentation, ARP, and ICMP. Xiaowei Yang

Subnet Masks. Address Boundaries. Address Assignment. Host. Net. Host. Subnet Mask. Non-contiguous masks. To Administrator. Outside the network

Configuring IP SLAs LSP Health Monitor Operations

Transcription:

The Impact of Router Outages on the AS-Level Internet Matthew Luckie* - University of Waikato Robert Beverly - Naval Postgraduate School *work started while at CAIDA, UC San Diego SIGCOMM 2017, August 24th 2017 www.caida.o 1

Internet Resilience Where are the Single Points of Failure? CE CE CE PE PE PE Example #A Example #B CE: Customer Edge PE: Provider Edge 2

Internet Resilience Where are the Single Points of Failure? CE If the CE router fails, the network is disconnected, so the CE router is a PE Single Point of Failure (SPoF) PE Example #A CE: Customer Edge PE: Provider Edge 3

Internet Resilience Where are the Single Points of Failure? CE CE If the CE router fails, the network has an PE alternate path available, so the CE router is NOT a Single Point of Failure (SPoF) Example #B CE: Customer Edge PE: Provider Edge 4

Internet Resilience Where are the Single Points of Failure? CE CE If the PE router fails, PE the customer network is disconnected, so the PE router is a Single Point of Failure (SPoF) CE: Customer Edge Example #B PE: Provider Edge 5

Challenges in topology analysis Prior approaches analyzed static AS-level and router-level topology graphs, - e.g.: Nature 2000 Important AS-level and router-level topology might be invisible to measurement, such as backup paths, - e.g: INFOCOM 2002 A router that appears to be central to a network s connectivity might not be - e.g.: AMS 2009 6

What we did Large-scale (Internet-wide) longitudinal (2.5 years) measurement study to characterize prevalence of Single Points of Failure (SPoF): 1. Efficiently inferred IPv6 router outage time windows 2. Associated routers with IPv6 BGP prefixes 3. Correlated router outages with BGP control plane 4. Correlated router outages with data plane 5. Validated inferences of SPoF with network operators 7

What we did Identified IPv6 router interfaces from traceroute 83K to 2.4M interfaces from CAIDA s Archipelago traceroute measurements 8

What we did probed router interfaces to infer outage windows We used a single vantage point located at CAIDA, UC San Diego for the duration of this study 9

Central counter: 9290 What we did 10

Central counter: 9290 9291 What we did T1: 9290 9290 10

Central counter: 9290 9291 9292 What we did T1: 9290 T2: 9291 9291 10

Central counter: 9290 9291 9292 9293 What we did T1: 9290 T2: 9291 9292 T3: 9292 10

Central counter: 9290 9291 9292 9293 9294 What we did T1: 9290 T2: 9291 9293 T3: 9292 T4: 9293 10

Central counter: 9290 9291 9292 9293 9294 9295 What we did T1: 9290 T2: 9291 9294 T3: 9292 T4: 9293 T5: 9294 10

Central counter: 9290 9291 9292 9293 9294 9295 1 What we did Reboot! T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 10

Central counter: 9290 9291 9292 9293 9294 9295 12 What we did T1: 9290 T2: 9291 1 T3: 9292 T4: 9293 T5: 9294 T6: 1 10

Central counter: 9290 9291 9292 9293 9294 9295 12 3 What we did T1: 9290 T2: 9291 2 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 10

Central counter: 9290 9291 9292 9293 9294 9295 12 34 What we did T1: 9290 T2: 9291 3 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 T8: 3 10

What we did probed router interfaces to infer outage windows using IPID T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 Outage Window T7: 2 T8: 3 Infer a reboot when time series of values returned from a router is discontinuous, indicating router was restarted 11

Why IPv6 fragment IDs? IPv4 Fragment IDs: - 16 bits, bursty velocity: every packet requires unique ID - At 100Mbps and 1500 byte packets, Nyquist rate dictates 4 second probing interval IPv6 Fragment IDs: - 32 bits, low velocity: IPv6 routers rarely send fragments - We average 15 minute probing interval 12

What we did correlated routers with prefixes using traceroute paths 13

2001:db8:2::/48 What we did correlated routers with prefixes Ark VP using traceroute paths 2001:db8:1::/48 50-60 Ark VPs traceroute every routed IPv6 prefix every day Ark VP 14

2001:db8:2::/48 What we did correlated routers with prefixes Ark VP using traceroute paths 2001:db8:1::/48 50-60 Ark VPs traceroute every routed IPv6 prefix every day Ark VP 14

2001:db8:2::/48 What we did computed distance of Ark VP router from AS announcing network 0 (CE) 2 1 (PE) 2001:db8:1::/48 CE: Customer Edge PE: Provider Edge 15

2001:db8:2::/48 What we did correlated router outage windows with BGP control plane 0 (CE) 2001:db8:1::/48 16

2001:db8:2::/48 What we did correlated router outage windows with BGP control plane T1: 9290 T2: 9291 T3: 9292 T4: 9293 Outage Window T5: 9294 T6: 1 T7: 2 T8: 3 2001:db8:1::/48 17

2001:db8:2::/48 What we did correlated router outage windows with BGP control plane RouteViews Outage Window T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 2001:db8:2::/48 T5.2: Peer-1 W T5.2: Peer-2 W T5.3: Peer-3 W T5.3: Peer-4 W T5.8: Peer-3 A T5.8: Peer-2 A T8: 3 2001:db8:1::/48 T5.8: Peer-1 A T5.8: Peer-4 A 18

What we did classified impact on BGP according to observed activity overlapping with inferred outage Complete Withdrawal: all peers simultaneously withdrew route for at least 70 seconds - Single Point of Failure (SPoF) Partial Withdrawal: at least one peer withdrew route for at least 70 seconds, but not all did Churn: BGP activity for the prefix No Impact: No observed BGP activity for the prefix 19

What we did Data Collection Summary Probed IPv6 routers at ~15 minute intervals from 18 Jan 2015 to 30 May 2017 (approx. 2.5 years) 149,560 routers allowed reboots to be detected We inferred 59,175 (40%) rebooted at least once,750k reboots in total 1 0.8 CDF 0.6 0.4 0.2 0 1 10 100 Number of Outages 20

What we found 2,385 (4%) of routers that rebooted (59K) we inferred to be SPoF for at least one IPv6 prefix in BGP Of SPoF routers, we inferred 59% to be customer edge router; 8% provider edge; 29% within destination AS No covering prefix for 70% of withdrawn prefixes - During one-week sample, covering prefix presence during withdrawal did not imply data plane reachability IPv6 Router reboots correlated with IPv4 BGP control plane activity 21

Limitations Applicability to IPv4 depends on router being dual-stack Requires IPID assigned from a counter - Cisco, Huawei, Vyatta, Microtik, HP assign from counter - 27.1% responsive for 14 days assigned from counter Router outage might end before all peers withdraw route - Path exploration + Minimum Route Advertisement Interval (MRAI) + Route Flap Dampening (RFD) Complex events: multiple router outages but one detected - We observed some complex events and filtered them out 22

Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 = Validated Inference = Incorrect Inference? = Not Validated 23

Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 Challenging to get validation data: operators often could only tell us about the last reboot 24

Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 No falsely inferred reboots: we correctly observed the last known reboot of each router 25

Validation Reboots SPoF Network?? US University 7 0 8 7 0 8 US R&E backbone #1 2 0 3 3 2 0 US R&E backbone #2 3 0 1 0 0 4 NZ R&E backbone 11 0 22 4 2 27 Total: 23 0 34 14 4 39 We did not detect some SPoFs 26

Number of Interfaces 3M 1M 100K 30K Data Collection Summary 10K Jan 15 23.5K All Incrementing Jul 15 (a) 83K Jan 16 46.5K Jul 16 ~1.1M (b) 79.8K (c) 41.8K 15.2K Jan 17 PPS List Unresponsive (a) 100 Static 83K 12-24 hours (b) 225 Static 1.1M 12-24 hours (c) 200 Dynamic, ~2.4M 7-14 days 27

Correlating BGP/router outages Control: six hours prior to inferred outages, Feb 2015 Fraction of Reboot/Prefix Pairs 0.5 0.4 0.3 0.2 0.1 Churn Partial Withdrawal Complete Withdrawal Outside Dest. AS Inside Dest. AS 0 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 Distance of Router from Destination AS (IP hops) 28

Correlating BGP/router outages During the inferred outages, Feb 2015 Fraction of Reboot/Prefix Pairs 0.5 0.4 0.3 0.2 0.1 Churn Partial Withdrawal Complete Withdrawal Outside Dest. AS Inside Dest. AS 0 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 Distance of Router from Destination AS (IP hops) 29

BGP Prefix Withdrawals: SPoF 1 CDF 0.8 0.6 0.4 0.2 min max 0 1 min 5 15 30 1 2 4 min min min hr hr hr Complete Withdrawal Duration 8 hr 16 hr 44% less than 5 minutes, suggestive of router maintenance or router crash 30

SPoF prefixes mostly single homed Especially Router hop distance 0 0.2 Fraction of Population 0.4 0.6 0.8 1 SPoFs outside 3 destination AS, 2 as expected PE 1 CE 0 1 2 3 Prefix announced through a single upstream Prefix announced through multiple upstreams 31

Impact on IPv4 prefixes in BGP 1 Cumulative Fraction Router Outages 0.9 0.8 0.7 0.6 0.5 0.4 Before Outage During Outage 0 0.2 0.4 0.6 0.8 1 Control Outage Withdrawn Peers/Advertising Peers We examined IPv4 prefixes for 5% sample of reboots. 19% of correlated IPv4 prefixes withdrawn by at least 90% of peers during router outage window. 32

Summary Step towards root-cause analysis of inter-domain routing outages and events Fraction of Reboot/Prefix Pairs 0.5 Churn Partial Withdrawal Complete Withdrawal 0.4 0.3 0.2 0.1 0 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 Distance of Router from Destination AS (IP hops) - Explore applicability of method to measurement of other critical Internet infrastructure: DNS, Web, Email In our 2.5 year sample of 59K routers that rebooted - 4% (2.3K) were SPoF - SPoF were mostly confined to the edge: 59% customer edge We released our code as part of scamper https://www.caida.org/tools/measurement/scamper/ 33

Backup Slides 34

Impact on IPv4 Services censys.io April 2017 Active Hosts 39,107 HTTP 25,592 HTTPS 16,321 } Web SSH 11,277 DNS 7,922 IMAP 5,127 } Email SMTP 7,383 We examined IPv4 prefixes for 5% sample of reboots where at least 90% of peers during router outage window. 35

Partial Withdrawals CDF 1 0.8 0.6 0.4 0.2 0 50% of pairs had 1 2 peers withdraw 10% of pairs had nearly all peers withdraw 0 0.2 0.4 0.6 0.8 1 Fraction of Peers Withdrawing Route 50% of pairs had 1-2 peers withdraw prefix 10% of pairs had nearly all peers withdraw prefix 36

Degrees of ASes monitored Cumulative Fraction of Rebooting ASes 1 0.8 0.6 0.4 0.2 0 Single Points of Failure Monitored Population 1 10 100 1000 AS Degree ASes that were inferred to have a SPoF were disproportionately low-degree ASes 37

Activity for IPv4 prefixes in BGP Cumulative Fraction Router Outages 1 0.8 0.6 0.4 0.2 0 Before Outage During Outage 0 0.2 0.4 0.6 0.8 1 Peers Sending Updates/Total Peers At least 70% of peers reported BGP activity on IPv4 prefixes for 50% of the inferred router outages 38

Reboot Window Durations CDF 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 min 5 min min max 15 min 30 min 1 hr 2 hr Reboot Window Duration 4 hr 8 hr 16 hr Half the maximum reboot lengths were less than 30 minutes (~two probing rounds) 39

Router + BGP outage correlation Router IP-ID Sequence: 10, 11, 12 1, 2, 3 Outage Window BGP Sequence: Withdraw-Contained W A Outage-Contained W A Withdraw-Before W A Announce-After W A 40

Data processing pipeline Uptime Prober rtr targets CAIDA IPv6 Topology <ip,time,ipid> Cassandra Inferred Reboots BGP Correlation AS border distance single points of failure Route Views <peer,time,prefix> 41

Inferring router position Provider Edge (PE) Router Customer Edge (CE) Router AS X AS Y x 1 R 1 x 2 R 2 x 3 R 3 y 1 R 4 y 2 R 5 2 1 0-1 -2 (a) interface addresses routed by Y appear in traceroute x 1 AS X AS Y x x 2 3?? R R 1 2 R R R 3 4 5 2 1 0 (b) no interface addresses routed by Y appear in traceroute 42

Data Collection Summary 18 Jan 15 18 Oct 16 (a) 18 Oct 16 24 Feb 17 (b) 24 Feb 17 30 May 17 (c) Probing rate 100 pps 225 pps 200 pps Interfaces 83K seen Dec 14 1.1M seen Jun to Oct 16 Dynamic. 2.4M in May 17 Responsive every round ~15 mins every round ~15 mins every round ~15 mins Unresponsive 12-24 hours 12-24 hours 7-14 days 43

Why IPv6 fragment IDs? IPv4 ID values are 16 bits with bursty velocity as every packet requires a unique value. Ver HL DSCP length identification offset TTL protocol checksum source address destination address At 100Mbps and 1500 byte packets. Nyquist rate dictates a 4 second probing interval 44

Why IPv6 fragment IDs? IPv6 ID values are 32 bits with low velocity as systems rarely send fragmented packets. Ver DSCP flow id payload length protocol TTL source address destination address protocol reserved offset identification 45

Soliciting IPv6 Fragment IDs echo request, 1300 bytes echo reply, 1300 bytes packet too big, MTU 1280 echo request, 1300 bytes echo reply, 1280 bytes Fragment ID: 12345 46