Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011
OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang, Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang, Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
Why Study Failure Failure is a reality for large network Achieving high availability requires engineering the network to be robust to failure Designing mechanisms to effectively mitigate failures requires deep understanding of real failures
CENIC Network Serving California educational institutions Over 200 routers 5 years of data Three Types of Components: The Digital California (DC) network The High-Performance Research (HPR) network Customer-premises equipment (CPE)
Contribution Methodology to reconstruct historical failure events of CENIC network Using only commonly available data, No need for additional instrumentation Analyze the network based on failure measurement
Reconstruction What data are available to reconstruct a failure 4 years later? Syslog Describes interface state changes Router Configuration Files Maps interfaces to Links Operation announcements on mailing list Data are not intended for failure reconstruction!
Validation Internal consistency Using the administrator announcements to validate the event history reconstructed. External consistency CAIDA Skitter project (now Ark) validating UP. Route Views project validating DOWN.
Overview of Link Failures
Overview of Link Failures
Overview of Link Failures Vertical banding V1: a network-wide IS-IS configuration change requiring a router restart V2: a network-wide software upgrade V3: a network-wide configuration change in preparation for IPv6 Horizontal banding H1: a series of failures on a link between a core router and a County of Education office (hardware) H2: this link experienced over 33,000 short-duration failures (fiber cut)
CDFs of Individual Failure Events
Various Link Hardware Types
Cause of Failure
Failure Events
Summary Engineering for failure requires real data - Data has historically been difficult to obtain Methodology to perform historical failure analysis with low-quality data sources Shared our findings in the CENIC network - Reliability of individual components - Causes of failures - Impact of failure
OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang, Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
Key Questions How could routing events cause degraded end-to-end path performance? How topological properties and routing policies affect performance degradation?
Approach Study end-to-end performance under realistic topologies. Investigate several metrics to characterize the end-to-end loss, delay, and out-of-order packets. Characterize the kinds of routing changes that impact end-to-end path performance. Analyze the impact of topology, routing policies, MRAI timer and ibgp configurations on end-to-end path performance.
Experiment Methodology A multi-homed prefix BGP Beacon prefix: 192.83.230.0/24 Controlled Routing Changes Failover events: Beacon changes from the state of being connected to both providers to the state of being connected to a single provider. Recovery events: Beacon changes from the state of being connected to a single provider to the state of being connected to both providers. ISP1 ISP 2 ISP 1 ISP 2 ISP 1 ISP 2 Failover event Recovery event Beacon Beacon Beacon
Controlled Routing Changes 12 routing events every day 8 for beacon events: o Failover events o Recovery events 4 for resetting the Beacon Connectivity. Time schedule (GMT) for BGP Beacon routing transitions
Active Probing Goal: capture the impact of routing changes on the end-toend performance. From 37 PlanetLab hosts to the Beacon host (a host within the Beacon prefix). host B host A Internet ISP 1 ISP 2 Beacon host host C Three probing methods: - Back-to-back traceroutes - Back-to-back pings - UDP probing (50msec interval) Data Plane Performance metrics Pack loss Delay Out-of-order Active probing traceroute ping UDP probing
Packet Loss Loss burst: consecutive UDP probing packets lost during a routing change event. Failover Recovery
Packet Delay Roundtrip delays from the probe host to the Beacon host (clock skews problem when using one-way delays). Failover Recovery
Out-of-order Packets Number of reordering (number of packets out of order) Reordering offset Failover Recovery
How Routing Failures Occur (Failover)? Prefer-customer routing policy: routes received from a provider s customers are always preferred over those received from its peers. 0 R2 Provider 1 Provider 2 Peer link 0 R1 R3 0 2 0 0 1 0 R4 R6 0 R5 0 Customer link Beacon AS 0
How Routing Failures Occur (Failover)? (contd.) No-valley routing policy: peers do not transit traffic from one peer to another. Peer link R7 1 0 R8 Provider 3 1 0 2 0 R9 2 0 1 0 0 R2 0 R1 R3 0 2 0 Peer link R4 0 1 0 R6 0 R5 0 Provider 1 Provider 2 Beacon AS 0
How Routing Failures Occur? (Recovery) ibgp constraint: a route received from an ibgp router cannot be transited to another ibgp router 1. Path 0 R3 recovery. 2. R3 sends the path to R2 3. R2 sends a withdrawal R1 path (0) to R1 4. R3 sends the recovery path to R1 5. R1 regains its connection to the Beacon Withdraw (2 0) Provider 1 R3 Path (0) R2 0 Beacon AS 0 Provider 2 R4
Summary During failover and recovery events Routing events impact packet loss significantly. Routing failures contribute to end-to-end packet loss significantly. Routing events can lead to long packet round-trip delays and reordering Routing policies and ibgp configuration play a major role in causing packet loss during routing events.
Discussion How could we prevent packet loss during path exploration? Would storing an alternative path in each router be a good idea? What are the downsides? How could we exploit the previous results to improve endto-end performance? How realistic could we consider the topology in the second paper?
References Feng Wang, Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance. SIGCOMM 2006. Feng Wang, Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. Presentation on SIGCOMM 2006. Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines. SIGCOMM 2010. Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. Presentation on SIGCOMM 2010.