NetPilot: Automating Datacenter Network Failure Mitigation

Size: px
Start display at page:

Download "NetPilot: Automating Datacenter Network Failure Mitigation"

Transcription

1 NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

2 Failures are Common and Harmful Network failures are common 10,000+ switches 2

3 Failures are Common and Harmful Network failures are common Failures cause long down times 3

4 Failures are Common and Harmful Six-month failure logs of production datacenters Network failures are common 25% of failures take 13+ hours to repair Failures cause long down times Time from detection to repair (minutes) 4

5 Failures are Common and Harmful Failures are common due to VERY large datacenters Failures cause long down times Long failure duration large revenue loss 5

6 Failures are Common and Harmful Failures are common due to VERY large datacenters Failures cause long down times Long failure duration large revenue loss 6

7 How to Shorten Failure Recovery Time?

8 Previous Work Conventional failure recovery takes 3 steps Detection Diagnosis Repair passive ping active 8

9 Previous Work Conventional failure recovery takes 3 steps Detection Diagnosis Repair Failure localization/diagnosis [M. K. Aguilera, SOSP 03] [M. Y. Chen, NSDI 04] [R.R Kompella, NSDI 05] [P.Bahl, SIGCOMM 07] [S. Kandula, SIGCOMM 09] 9

10 Automating Failure Diagnosis is Challenging Root causes are deep in network stack Diagnosis involves multiple parties 10

11 Category Failure types Diagnosis & Repair Software 21% Link layer loop Find and fix 19% Imbalance overload bugs 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% Unknown 23% Switch stops forwarding N/A 9% Configuration 38% Imbalance overload 7% Lost configuration 5% High CPU utilization 2% Errors on multiple switches Update configuration 1. Root causes are deep in the network stack % 32% Errors on one switch 6% Six-month failure logs from several production DCNs 11

12 Category Failure types Diagnosis & Repair Software 21% Link layer loop Find and fix 19% Imbalance overload bugs 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% Unknown 23% Switch stops forwarding N/A 9% Configuration 38% Imbalance overload 7% Lost configuration 5% High CPU utilization 2% Errors on multiple switches 2. Diagnosis involves multiple parties Update configuration 1. Root causes are deep in the network stack % 32% Errors on one switch 6% Six-month failure logs from several production DCNs 12

13 Category Failure types Diagnosis & Repair Software 21% Link layer loop Find and fix 19% Imbalance overload bugs 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% 2. Diagnosis involves Failure Diagnosis Imbalance multiple overload parties Requires 7% Unknown 23% Switch stops forwarding N/A 9% Configuration 38% Lost configuration 5% Human Intervention! High CPU utilization 2% Errors on multiple switches Update configuration 1. Root causes are deep in the network stack % 32% Errors on one switch 6% Six-month failure logs from several production DCNs 13

14 Can we do something other than failure diagnosis?

15 NetPilot: Mitigating rather than Diagnosing Failures Mitigate failure symptoms ASAP, at the cost of reduced capacity Detection Diagnosis Repair 15

16 16

17 NetPilot Benefits Short recovery time Small network disruption Low operation cost Automated Detection Diagnosis Repair Mitigation 17

18 Failure Mitigation is Effective Most failures can be mitigated by simple actions Mitigation is feasible due to redundancy 18

19 Category Failure types Mitigation Repair % Software 21% Hardware 18% Unknown 23% Configurati on 38% Link layer loop Deactivate port Find and fix Imbalancetriggered Restart switch bugs overload 19% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Switch stops forwarding Imbalancetriggered overload 2% Restart switch N/A 9% Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Errors on multiple switches Errors on single switch Restart switch 2% n/a Update configuration 32% Deactivate switch 6% 19

20 Category Failure types Mitigation Repair % Software 21% Hardware 18% Unknown 23% Configurati on 38% Link layer loop Deactivate port Find and fix Imbalancetriggered Restart switch bugs overload 19% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Switch stops forwarding Imbalancetriggered overload 2% Restart switch N/A 9% Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Errors on multiple switches Errors on single switch Restart switch 2% n/a Update configuration 32% Deactivate switch 6% 20

21 Category Failure types Mitigation Repair % Software 21% Hardware 18% Unknown 23% Configurati on 38% Link layer loop Deactivate port Find and fix Imbalancetriggered Restart switch bugs overload 19% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% 68% of failures can be Switch stops forwarding Imbalancetriggered overload 2% Restart switch N/A 9% mitigated by simple actions Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Errors on multiple switches Errors on single switch Restart switch 2% n/a Update configuration 32% Deactivate switch 6% 21

22 22

23 23

24 Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? NetPilot evaluations Conclusion 24

25 A Strawman NetPilot: Trial-and-error Network failure Localization Roll back if necessary Execute an action No Failure mitigated? Yes End 25

26 NetPilot: Challenges & Solutions Network failure Localization 1. Blind trial-and-error Roll back if necessary Execute an action takes a long time No Failure mitigated? Yes End 26

27 NetPilot: Challenges & Solutions Network failure Localization Roll back if necessary Execute an action 1. Blind trial-and-error takes a long time Failure specific localization No Failure mitigated? Yes End 27

28 NetPilot: Challenges & Solutions Network failure Localization Roll back if necessary Estimate impact Execute an action 2. Partition/overload network Impact estimation No Failure mitigated? Yes End 28

29 29

30 30

31 NetPilot: Challenges & Solutions Network failure Localization Roll back if necessary Estimate impact Rank actions Execute an action 3. Different actions have different side-effects Rank actions based on impact No Failure mitigated? Yes End 31

32 Failure Specific Localization Limited # of failure types Domain knowledge improves accuracy Failure types 1. Link layer loop 2. Imbalance-triggered overload 3. FCS error 4. Unstable power 5. Switch stops forwarding 6. Imbalance-triggered overload 7. Lost configuration 8. High CPU utilization 9. Errors on multiple switches 10. Errors on single switch 32

33 Example: Frame Check Sequence (FCS) Errors 13% of all the failures Cut-through switching Forward frames before checksums are verified Increase application latency 33

34 Localizing FCS Errors error frames seen on L frames corrupted by L frames corrupted by other links & traverse L x L : link corruption rate # of variables = # of equations = # of links Corrupted links: x L > 0 34

35 NetPilot Overview Network failure Localization Estimate impact Roll back if necessary Rank actions Execute an action No Failure mitigated? Yes End 35

36 Impact Metrics Derived from Service Level Agreement (SLA) Availability: online_server_ratio Packet loss: total_lost_pkt latency: max_link_utilization Small link utilization small (queuing) delay Total_lost_pkt & max_link_utilization derived from utilization of individual links 36

37 Estimating Link Utilization Action Traffic Topology Impact Estimator Link utilization # of flows >> redundant paths Traffic evenly distributed under ECMP Estimate the load contributed by each flow on each link Sum up the loads to compute utilization 37

38 Link Utilization Estimation is Highly Accurate 1-month traffic from a 8000-server network Log socket events on each server Ground truth: SNMP counters 38

39 NetPilot Overview Network failure Localization Roll back if necessary Estimate impact Rank actions Execute an action Choose the action with the least impact No Failure mitigated? Yes End 39

40 Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? Localization impact estimation ranking NetPilot evaluations Mitigating load imbalance Mitigating FCS errors Mitigating overload Conclusion 40

41 Load Imbalance Agg a stops receiving traffic Localize to 4 suspects core a core b Agg a Agg b 41

42 42

43 43

44 44

45 45

46 46

47 47

48 Fast FCS Error Mitigation Human operator: after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated NetPilot: deactivates 2 links in 1 trial within 15 minutes 48

49 Fast FCS Error Mitigation 3.5 hours 15 minutes Human operator: after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated NetPilot: deactivates 2 links in 1 trial within 15 minutes 49

50 50

51 51

52 Mitigating Link Overload Mitigate overload by deactivating healthy links Many candidate links in production networks Choose the link(s) with the least impact core 1 core 2 core 1 core 2 core 1 core 2 agg agg agg 3 lost

53 Action Ranking Lowers Link Utilization Replay 97 overload incidents due to link failures 53

54 Conclusion Mitigation reduces failure recovery time Simple actions are effective Made possible by redundancy NetPilot: automating failure mitigation Recovery time: hour minutes Several mitigation scenarios deployed in Bing 54

55 Thank You! NetPilot: Detection Automated Diagnosis Repair Mitigation 55

56 56

57 NetPilot Shortens Recovery Time Time from detection to mitigation 6 months, many production datacenters NetPilot mitigate 3 types of failures all with in 30 minutes Operators work around 50% failures in 2 HOURS 57

Automated Bug Removal for Software-Defined Networks

Automated Bug Removal for Software-Defined Networks Automated Bug Removal for Software-Defined Networks Yang Wu* Ang Chen* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University of Pennsylvania + Georgetown University 1 Motivation: Automated repair

More information

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz 1 A Typical Facebook Page Modern pages have many components

More information

Why do Internet services fail, and what can be done about it?

Why do Internet services fail, and what can be done about it? Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University of California at Berkeley 4 th USENIX Symposium

More information

Democratically Finding The Cause of Packet Drops

Democratically Finding The Cause of Packet Drops Democratically Finding The Cause of Packet Drops Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang (Harry) Liu, Jitu Padhye, Geoff Outhred, Boon Thau Loo 1 Marple- SigComm 2017 Sherlock- SigComm

More information

Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure

Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang 1, Guo Yu 2, Chuanxiong Guo 3, Yingnong Dang 4, Nick Swanson 4, Xinsheng Yang 4, Randolph Yao 4, Murali Chintalapati

More information

Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011

Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang, Zhuoqing Morley MaoJia Wang3, Lixin Gao and

More information

Sherlock Diagnosing Problems in the Enterprise

Sherlock Diagnosing Problems in the Enterprise Sherlock Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang Enterprise Management: Between a Rock and a Hard Place Manageability

More information

Solving Practical Problems in Datacenter Networks

Solving Practical Problems in Datacenter Networks Solving Practical Problems in Datacenter Networks by Xin Wu Department of Computer Science Duke University Date: Approved: Xiaowei Yang, Supervisor Bruce Maggs Jeffrey Chase Romit Roy Choudhury Dissertation

More information

Automatic Life Cycle Management of Network Configurations

Automatic Life Cycle Management of Network Configurations Hongqiang Harry Liu, Xin Wu, Wei Zhou, Weiguo Chen, Tao Wang, Hui Xu, Lei Zhou, Qing Ma, Ming Zhang Alibaba Group ABSTRACT Managing the life cycle of network configurations, including the generation, update,

More information

The Day the DNS Died

The Day the DNS Died The Day the DNS Died Jeremy Blosser, Principal Operations Engineer jblosser@sparkpost.com https://tinyurl.com/spdnstalk 1 Introduction SparkPost, aka Message Systems, is a high-volume, transactional email

More information

Data Center TCP (DCTCP)

Data Center TCP (DCTCP) Data Center Packet Transport Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Cloud computing

More information

Lecture 15: Datacenter TCP"

Lecture 15: Datacenter TCP Lecture 15: Datacenter TCP" CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Mohammad Alizadeh Lecture 15 Overview" Datacenter workload discussion DC-TCP Overview 2 Datacenter Review"

More information

Debugging the Data Plane with Anteater

Debugging the Data Plane with Anteater Debugging the Data Plane with Anteater Haohui Mai, Ahmed Khurshid Rachit Agarwal, Matthew Caesar P. Brighten Godfrey, Samuel T. King University of Illinois at Urbana-Champaign Network debugging is challenging

More information

Configuring Switch Latency Monitoring

Configuring Switch Latency Monitoring This chapter contains the following sections: Information About Switch Latency Monitoring, page 1 How to Configure Switch Latency Monitoring, page 3 Configuration Examples for Switch Latency Monitoring,

More information

10X: Power System Technology 10 Years Ahead of Industry International Standards- Based Communications

10X: Power System Technology 10 Years Ahead of Industry International Standards- Based Communications 10X: Power System Technology 10 Years Ahead of Industry International Standards- Based Communications David Dolezilek International Technical Director Information Technology (IT) Methods Jeopardize Operational

More information

CQNCR: Optimal VM Migration Planning in Cloud Data Centers

CQNCR: Optimal VM Migration Planning in Cloud Data Centers CQNCR: Optimal VM Migration Planning in Cloud Data Centers Presented By Md. Faizul Bari PhD Candidate David R. Cheriton School of Computer science University of Waterloo Joint work with Mohamed Faten Zhani,

More information

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets Rahul Potharaju (Purdue University) Navendu Jain (Microsoft Research) Cristina Nita-Rotaru (Purdue University) April

More information

Improving Network Agility with Seamless BGP Reconfigurations

Improving Network Agility with Seamless BGP Reconfigurations Improving Network Agility with Seamless BGP Reconfigurations Corneliu Claudiu Prodescu School of Engineering and Sciences Jacobs University Bremen Campus Ring 1, 28759 Bremen, Germany Monday 18 th March,

More information

Configure IP SLA Tracking for IPv4 Static Routes on an SG550XG Switch

Configure IP SLA Tracking for IPv4 Static Routes on an SG550XG Switch Configure IP SLA Tracking for IPv4 Static Routes on an SG550XG Switch Introduction When using static routing, you may experience a situation where a static route is active, but the destination network

More information

A Network-State Management Service. Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft

A Network-State Management Service. Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft A Network-State Management Service Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft Complex Infrastructure 1 Complex Infrastructure Microsoft Azure Number

More information

Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead. Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, Wenchao Zhou

Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead. Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, Wenchao Zhou Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, Wenchao Zhou Motivation D E Alice A B C foo.com An example scenario:

More information

"Charting the Course... TSHOOT Troubleshooting and Maintaining Cisco IP Networks Course Summary

Charting the Course... TSHOOT Troubleshooting and Maintaining Cisco IP Networks Course Summary Course Summary Description This course is designed to help network professionals improve the skills and knowledge that they need to maintain their network and to diagnose and resolve network problems quickly

More information

Data Center TCP (DCTCP)

Data Center TCP (DCTCP) Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research Stanford University 1

More information

Media Path Analysis. Analyzing Media Paths Using IP SLA. Before You Begin. This section contains the following:

Media Path Analysis. Analyzing Media Paths Using IP SLA. Before You Begin. This section contains the following: This section contains the following: Analyzing Media Paths Using IP SLA, page 1 Analyzing Media Paths Using VSAA, page 3 Managing a Video Test Call, page 6 Analyzing Media Paths Using IP SLA To start a

More information

MULTINATIONAL BANKING CORPORATION INVESTS IN ROUTE ANALYTICS TO AVOID OUTAGES

MULTINATIONAL BANKING CORPORATION INVESTS IN ROUTE ANALYTICS TO AVOID OUTAGES MULTINATIONAL BANKING CORPORATION INVESTS IN ROUTE ANALYTICS TO AVOID OUTAGES CASE STUDY Table of Contents Organization Background and Network Summary 3 Outage Precursor and Impact 3 Outage Analysis 4

More information

Consistent SDN Flow Migration aided by Optical Circuit Switching. Rafael Lourenço December 2nd, 2016

Consistent SDN Flow Migration aided by Optical Circuit Switching. Rafael Lourenço December 2nd, 2016 Consistent SDN Flow Migration aided by Optical Circuit Switching Rafael Lourenço December 2nd, 2016 What is Flow Migration? Installation/update of new rules on [multiple] [asynchronous] SDN switches to

More information

The fundamentals of Ethernet!

The fundamentals of Ethernet! Building Ethernet Connectivity Services for Provider Networks" " Eduard Bonada i Cruells" Tesi Doctoral UPF / 2012 Dirigida per Dra. Dolors Sala i Batlle Departament de Tecnologies de la Informació i les

More information

Cisco I/O Accelerator Deployment Guide

Cisco I/O Accelerator Deployment Guide Cisco I/O Accelerator Deployment Guide Introduction This document provides design and configuration guidance for deploying the Cisco MDS 9000 Family I/O Accelerator (IOA) feature, which significantly improves

More information

Traffic Engineering with Forward Fault Correction

Traffic Engineering with Forward Fault Correction Traffic Engineering with Forward Fault Correction Harry Liu Microsoft Research 06/02/2016 Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter 1 Cloud services require large

More information

Deadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen

Deadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen Deadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen Presenter: Haiying Shen Associate professor *Department of Electrical and Computer Engineering, Clemson University,

More information

IP SLAs Overview. Finding Feature Information. Information About IP SLAs. IP SLAs Technology Overview

IP SLAs Overview. Finding Feature Information. Information About IP SLAs. IP SLAs Technology Overview This module describes IP Service Level Agreements (SLAs). IP SLAs allows Cisco customers to analyze IP service levels for IP applications and services, to increase productivity, to lower operational costs,

More information

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University)

More information

LON2 is Live LINX May

LON2 is Live LINX May LON2 21/05/18 LON2 is Live LINX 101 21 May 2018 2 The New Network LINX 101 3 The Live Network LINX 101 21 May 2018 4 Remaining ExtremeLAN LINX 101 21 May 2018 5 New Technologies > IP Fabric leaf/spine

More information

Cisco SAN Analytics and SAN Telemetry Streaming

Cisco SAN Analytics and SAN Telemetry Streaming Cisco SAN Analytics and SAN Telemetry Streaming A deeper look at enterprise storage infrastructure The enterprise storage industry is going through a historic transformation. On one end, deep adoption

More information

Port Tapping Session 2 Race tune your infrastructure

Port Tapping Session 2 Race tune your infrastructure Port Tapping Session 2 Race tune your infrastructure Born on Oct 30 th 2012. 2 3 Tap Module Red adapter indicates TAP port 4 Corning Fibre Channel and Ethernet Tap s 72 Ports per 1U 288 Ports per 4U 5

More information

Lecture 16: Data Center Network Architectures

Lecture 16: Data Center Network Architectures MIT 6.829: Computer Networks Fall 2017 Lecture 16: Data Center Network Architectures Scribe: Alex Lombardi, Danielle Olson, Nicholas Selby 1 Background on Data Centers Computing, storage, and networking

More information

IPv6 Management 101 Share Session Anaheim

IPv6 Management 101 Share Session Anaheim IPv6 Management 101 Share Session Anaheim Laura Knapp WW Business Consultant Laurak@aesclever.com 07/27/2012 Applied Expert Systems, Inc. 2012 1 The Past What network protocols did you run before 1990?

More information

Cutting the Cord: A Robust Wireless Facilities Network for Data Centers

Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Yibo Zhu, Xia Zhou, Zengbin Zhang, Lin Zhou, Amin Vahdat, Ben Y. Zhao and Haitao Zheng U.C. Santa Barbara, Dartmouth College, U.C.

More information

Configuring Cisco IOS IP SLAs Operations

Configuring Cisco IOS IP SLAs Operations CHAPTER 50 This chapter describes how to use Cisco IOS IP Service Level Agreements (SLAs) on the switch. Cisco IP SLAs is a part of Cisco IOS software that allows Cisco customers to analyze IP service

More information

PathMon: Path-Specific Traffic Monitoring in OpenFlow-Enabled Networks

PathMon: Path-Specific Traffic Monitoring in OpenFlow-Enabled Networks PathMon: Path-Specific Traffic Monitoring in OpenFlow-Enabled Networks Ming-Hung Wang, Shao-You Wu, Li-Hsing Yen, and Chien-Chao Tseng Dept. Computer Science, National Chiao Tung University Hsinchu, Taiwan,

More information

RDMA over Commodity Ethernet at Scale

RDMA over Commodity Ethernet at Scale RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn ACM SIGCOMM 2016 August 24 2016 Outline RDMA/RoCEv2 background DSCP-based

More information

Configuring Cisco IOS IP SLA Operations

Configuring Cisco IOS IP SLA Operations CHAPTER 58 This chapter describes how to use Cisco IOS IP Service Level Agreements (SLA) on the switch. Cisco IP SLA is a part of Cisco IOS software that allows Cisco customers to analyze IP service levels

More information

Quickly Pinpoint and Resolve Problems in Windows /.NET Applications TECHNICAL WHITE PAPER

Quickly Pinpoint and Resolve Problems in Windows /.NET Applications TECHNICAL WHITE PAPER Quickly Pinpoint and Resolve Problems in Windows /.NET Applications TECHNICAL WHITE PAPER Table of Contents Executive Overview...1 Problem Resolution A Major Time Consumer...2 > Inefficiencies of the Problem

More information

DevoFlow: Scaling Flow Management for High-Performance Networks

DevoFlow: Scaling Flow Management for High-Performance Networks DevoFlow: Scaling Flow Management for High-Performance Networks Andy Curtis Jeff Mogul Jean Tourrilhes Praveen Yalagandula Puneet Sharma Sujata Banerjee Software-defined networking Software-defined networking

More information

Configuring StackWise Virtual

Configuring StackWise Virtual Finding Feature Information, page 1 Restrictions for Cisco StackWise Virtual, page 1 Prerequisites for Cisco StackWise Virtual, page 2 Information About Cisco Stackwise Virtual, page 2 Cisco StackWise

More information

Towards Predictable + Resilient Multi-Tenant Data Centers

Towards Predictable + Resilient Multi-Tenant Data Centers Towards Predictable + Resilient Multi-Tenant Data Centers Presenter: Ali Musa Iftikhar (Tufts University) in joint collaboration with: Fahad Dogar (Tufts), {Ihsan Qazi, Zartash Uzmi, Saad Ismail, Gohar

More information

KEMP 360 Vision. KEMP 360 Vision. Product Overview

KEMP 360 Vision. KEMP 360 Vision. Product Overview KEMP 360 Vision Product Overview VERSION: 1.0 UPDATED: SEPTEMBER 2016 Table of Contents 1 Introduction... 3 1.1 Document Purpose... 3 1.2 Intended Audience... 3 2 Architecture... 4 3 Sample Scenarios...

More information

Network Survivability

Network Survivability Network Survivability Bernard Cousin Outline Introduction to Network Survivability Types of Network Failures Reliability Requirements and Schemes Principles of Network Recovery Performance of Recovery

More information

Managed WAN SLA. Contents

Managed WAN SLA. Contents Managed WAN SLA Contents Terminology... 2 Service Description... 2 General... 2 Levels and Offerings... 2 Private Network Services... 2 Features... 2 Internet Access... 3 Features... 3 Service Level Metrics...

More information

To Filter or to Authorize: Network-Layer DoS Defense against Multimillion-node Botnets. Xiaowei Yang Duke Unversity

To Filter or to Authorize: Network-Layer DoS Defense against Multimillion-node Botnets. Xiaowei Yang Duke Unversity To Filter or to Authorize: Network-Layer DoS Defense against Multimillion-node Botnets Xiaowei Yang Duke Unversity Denial of Service (DoS) flooding attacks Send packet floods to a targeted victim Exhaust

More information

SWAN: Software-driven wide area network. Ratul Mahajan

SWAN: Software-driven wide area network. Ratul Mahajan SWAN: Software-driven wide area network Ratul Mahajan Partners in crime Vijay Gill Chi-Yao Hong Srikanth Kandula Ratul Mahajan Mohan Nanduri Ming Zhang Roger Wattenhofer Rohan Gandhi Xin Jin Harry Liu

More information

Managed WAN SLA. Contents

Managed WAN SLA. Contents Managed WAN SLA Contents Terminology... 2 Service Description... 2 Service Offerings... 2 Private Network Services... 2 Ethernet Connectivity... 2 T-1 Connectivity... 3 Other Connectivity... 3 Internet

More information

Service Recovery & Availability. Robert Dickerson June 2010

Service Recovery & Availability. Robert Dickerson June 2010 Service Recovery & Availability Robert Dickerson June 2010 Started in 1971 with $3,000, 40 clients and 1 employee. 2009: over $2B revenue, 500,000+ clients, 13,000 employees. Payroll / Tax Services / 401(k)

More information

BGP#: A System for Dynamic Route Control In Data Centers

BGP#: A System for Dynamic Route Control In Data Centers BGP#: A System for Dynamic Route Control In Data Centers Chao-Chih Chen UC Davis* Lihua Yuan Albert Greenberg Randy Kern Tao Zhang Parantap Lahiri John Arnold Kevin Grady Microsoft *Also a Microsoft Intern

More information

Detailed diagnosis in enterprise networks. Network diagnosis

Detailed diagnosis in enterprise networks. Network diagnosis Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl Network diagnosis Explaining faulty behavior 1 Current landscape

More information

Protecting remote site data SvSAN clustering - failure scenarios

Protecting remote site data SvSAN clustering - failure scenarios White paper Protecting remote site data SvSN clustering - failure scenarios Service availability and data integrity are key metrics for enterprises that run business critical applications at multiple remote

More information

Testing Video over IP Product and Services

Testing Video over IP Product and Services GIGANET S Y S T E M S Precision Performance Repeatability Testing Video over IP Product and Services Application Note Introduction Video over IP has gone mainstream. Over the last few years, the number

More information

High Availability and Disaster Recovery Solutions for Perforce

High Availability and Disaster Recovery Solutions for Perforce High Availability and Disaster Recovery Solutions for Perforce This paper provides strategies for achieving high Perforce server availability and minimizing data loss in the event of a disaster. Perforce

More information

Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions

Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions 1 Agenda Business disconnect Why is troubleshooting so hard? Troubleshooting methodology Tool selection Finding the

More information

Configuring IP SLAs LSP Health Monitor Operations

Configuring IP SLAs LSP Health Monitor Operations Configuring IP SLAs LSP Health Monitor Operations This module describes how to configure an IP Service Level Agreements (SLAs) label switched path (LSP) Health Monitor. LSP health monitors enable you to

More information

CompTIA Mobility+ Certification

CompTIA Mobility+ Certification CompTIA Mobility+ Certification Duration: 5 days Price: $4000 Certifications: CompTIA Mobility+ Exams: MB0-001 Course Overview The mobile age is upon us. More and more people are using tablets, smartphones,

More information

Pricing Intra-Datacenter Networks with

Pricing Intra-Datacenter Networks with Pricing Intra-Datacenter Networks with Over-Committed Bandwidth Guarantee Jian Guo 1, Fangming Liu 1, Tao Wang 1, and John C.S. Lui 2 1 Cloud Datacenter & Green Computing/Communications Research Group

More information

Reference Architecture. 28 MAY 2018 vrealize Operations Manager 6.7

Reference Architecture. 28 MAY 2018 vrealize Operations Manager 6.7 28 MAY 2018 vrealize Operations Manager 6.7 You can find the most up-to-date technical documentation on the VMware website at: https://docs.vmware.com/ If you have comments about this documentation, submit

More information

High Availability Architectures for Ethernet in Manufacturing

High Availability Architectures for Ethernet in Manufacturing High Availability Architectures for Ethernet in Manufacturing Written by: Paul Wacker, Advantech Corporation, Industrial Automation Group Outside of craft manufacture, like blacksmithing, or custom jewelry

More information

Get the skills to maintain your networks and to diagnose and resolve network problems quickly and effectively.

Get the skills to maintain your networks and to diagnose and resolve network problems quickly and effectively. Cisco CCNP - HD Telepresence TSHOOT: Troubleshooting and Maintaining Cisco IP Networks (TSHOOT) 2.0 Get the skills to maintain your networks and to diagnose and resolve network problems quickly and effectively.

More information

Approaches for Resilience Against Cascading Failures in Cloud Datacenters

Approaches for Resilience Against Cascading Failures in Cloud Datacenters Approaches for Resilience Against Cascading Failures in Cloud Datacenters Haoyu Wang, Haiying Shen and Zhuozhao Li Department of Electrical and Computer Engineering University of Virginia Email: {hw8c,

More information

Cutting the Cord: A Robust Wireless Facilities Network for Data Centers

Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Yibo Zhu, Xia Zhou, Zengbin Zhang, Lin Zhou, Amin Vahdat, Ben Y. Zhao and Haitao Zheng U.C. Santa Barbara, Dartmouth College, U.C.

More information

Reference Architecture

Reference Architecture vrealize Operations Manager 6.5 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent

More information

PreFix: Switch Failure Prediction in Datacenter Networks

PreFix: Switch Failure Prediction in Datacenter Networks 1 PreFix: Switch Failure Prediction in Datacenter Networks Joint work with Sen Yang 4 Shenglin Zhang 1, Ying Liu 2, Weibin Meng 2, Zhiling Luo 3, Jiahao Bu 2, Peixian Liang 5, Dan Pei 2, Jun Xu 4, Yuzhi

More information

Per-Packet Load Balancing in Data Center Networks

Per-Packet Load Balancing in Data Center Networks Per-Packet Load Balancing in Data Center Networks Yagiz Kaymak and Roberto Rojas-Cessa Abstract In this paper, we evaluate the performance of perpacket load in data center networks (DCNs). Throughput and

More information

AppDynamics Lite vs. Pro Edition

AppDynamics Lite vs. Pro Edition An AppDynamics Datasheet AppDynamics Lite vs. Pro Edition AppDynamics, the leader in application performance management (APM) for the cloud generation, offers both a Lite and Pro edition of its monitoring

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

RELEASING LATENT VALUE DOCUMENT: CA NETMASTER NETWORK MANAGEMENT R11.5. Releasing the Latent Value of CA NetMaster Network Management r11.

RELEASING LATENT VALUE DOCUMENT: CA NETMASTER NETWORK MANAGEMENT R11.5. Releasing the Latent Value of CA NetMaster Network Management r11. RELEASING LATENT VALUE DOCUMENT: CA NETMASTER NETWORK MANAGEMENT R11.5 Releasing the Latent Value of CA NetMaster Network Management r11.5 Table of Contents Product Situation Analysis Analysis of Problems

More information

VMware vcenter Operations Manager Getting Started Guide

VMware vcenter Operations Manager Getting Started Guide VMware vcenter Operations Manager Getting Started Guide Custom User Interface vcenter Operations Manager 5.7 This document supports the version of each product listed and supports all subsequent versions

More information

Collisions & Virtual collisions in IEEE networks

Collisions & Virtual collisions in IEEE networks Collisions & Virtual collisions in IEEE 82.11 networks Libin Jiang EE228a project report, Spring 26 Abstract Packet collisions lead to performance degradation in IEEE 82.11 [1] networks. The carrier-sensing

More information

Reference Architecture. Modified on 17 AUG 2017 vrealize Operations Manager 6.6

Reference Architecture. Modified on 17 AUG 2017 vrealize Operations Manager 6.6 Modified on 17 AUG 2017 vrealize Operations Manager 6.6 You can find the most up-to-date technical documentation on the VMware Web site at: https://docs.vmware.com/ The VMware Web site also provides the

More information

Presented by: Fabián E. Bustamante

Presented by: Fabián E. Bustamante Presented by: Fabián E. Bustamante A. Nikravesh, H. Yao, S. Xu, D. Choffnes*, Z. Morley Mao Mobisys 2015 *Based on the authors slides Mobile apps are increasingly popular Mobile platforms is the dominant

More information

isco Understanding Spanning Tree Protocol Topology Chan

isco Understanding Spanning Tree Protocol Topology Chan isco Understanding Spanning Tree Protocol Topology Chan Table of Contents Understanding Spanning Tree Protocol Topology Changes...1 Interactive: This document offers customized analysis of your Cisco device...1

More information

What Does the EIGRP DUAL 3 SIA Error Message Mean?

What Does the EIGRP DUAL 3 SIA Error Message Mean? What Does the EIGRP DUAL 3 SIA Error Message Mean? Document ID: 13676 Contents Introduction Prerequisites Requirements Components Used Conventions Background Information What Causes the EIGRP DUAL 3 SIA

More information

Oracle Java SE Advanced for ISVs

Oracle Java SE Advanced for ISVs Oracle Java SE Advanced for ISVs Oracle Java SE Advanced for ISVs is designed to enhance the Java based solutions that ISVs are providing to their enterprise customers. It brings together industry leading

More information

A SKY Computers White Paper

A SKY Computers White Paper A SKY Computers White Paper High Application Availability By: Steve Paavola, SKY Computers, Inc. 100000.000 10000.000 1000.000 100.000 10.000 1.000 99.0000% 99.9000% 99.9900% 99.9990% 99.9999% 0.100 0.010

More information

Venice: Reliable Virtual Data Center Embedding in Clouds

Venice: Reliable Virtual Data Center Embedding in Clouds Venice: Reliable Virtual Data Center Embedding in Clouds Qi Zhang, Mohamed Faten Zhani, Maissa Jabri and Raouf Boutaba University of Waterloo IEEE INFOCOM Toronto, Ontario, Canada April 29, 2014 1 Introduction

More information

Data Center TCP (DCTCP)

Data Center TCP (DCTCP) Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Stanford University MicrosoD Research Case

More information

Rhapsody Interface Management and Administration

Rhapsody Interface Management and Administration Rhapsody Interface Management and Administration Welcome The Rhapsody Framework Rhapsody Processing Model Application and persistence store files Web Management Console Backups Route, communication and

More information

Synology High Availability (SHA)

Synology High Availability (SHA) Synology High Availability (SHA) Based on DSM 5.1 Synology Inc. Synology_SHAWP_ 20141106 Table of Contents Chapter 1: Introduction... 3 Chapter 2: High-Availability Clustering... 4 2.1 Synology High-Availability

More information

Reliability in White Rabbit Network

Reliability in White Rabbit Network Reliability in White Rabbit Network Maciej Lipiński Hardware and Timing Section / Institute of Electronic Systems CERN / Warsaw University of Technology February 8 & 9, 2013 Wilga Symposium Warsaw Maciej

More information

Gamma Service Incident Report Final 18/9/14

Gamma Service Incident Report Final 18/9/14 Gamma Service Report Final 18/9/14 Broadband Service Please read the following as it could have an impact on some of your customers. Reference: Start Date: Start Time: Actual Clear Date: Actual Clear Time:

More information

Resilience Validation

Resilience Validation Resilience Validation Ramkumar Natarajan & Manager - NFT Cognizant Technology Solutions Abstract With the greater complexities in application and infrastructure landscapes, the risk of failure is ever

More information

TCP WISE: One Initial Congestion Window Is Not Enough

TCP WISE: One Initial Congestion Window Is Not Enough TCP WISE: One Initial Congestion Window Is Not Enough Xiaohui Nie $, Youjian Zhao $, Guo Chen, Kaixin Sui, Yazheng Chen $, Dan Pei $, MiaoZhang, Jiyang Zhang $ 1 Motivation Web latency matters! latency

More information

Windows Azure Services - At Different Levels

Windows Azure Services - At Different Levels Windows Azure Windows Azure Services - At Different Levels SaaS eg : MS Office 365 Paas eg : Azure SQL Database, Azure websites, Azure Content Delivery Network (CDN), Azure BizTalk Services, and Azure

More information

Detection and Localization of Network Black Holes

Detection and Localization of Network Black Holes Detection and Localization of Network Black Holes Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C. Snoeren University of California, San Diego, AT&T Labs Research, Micorsoft Research

More information

Authors: Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy, Thomas Anderson, Jie Gao

Authors: Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy, Thomas Anderson, Jie Gao Title: Moving Beyond End-to-End Path Information to Optimize CDN Performance Authors: Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy, Thomas Anderson, Jie Gao

More information

Configuring Cisco StackWise Virtual

Configuring Cisco StackWise Virtual Finding Feature Information, page 1 Restrictions for Cisco StackWise Virtual, page 1 Prerequisites for Cisco StackWise Virtual, page 3 Information About Cisco Stackwise Virtual, page 3 Cisco StackWise

More information

Monitor Qlik Sense sites. Qlik Sense Copyright QlikTech International AB. All rights reserved.

Monitor Qlik Sense sites. Qlik Sense Copyright QlikTech International AB. All rights reserved. Monitor Qlik Sense sites Qlik Sense 2.1.2 Copyright 1993-2015 QlikTech International AB. All rights reserved. Copyright 1993-2015 QlikTech International AB. All rights reserved. Qlik, QlikTech, Qlik Sense,

More information

ENSURING SCADA NETWORK CONTINUITY WITH ROUTING AND TRAFFIC ANALYTICS

ENSURING SCADA NETWORK CONTINUITY WITH ROUTING AND TRAFFIC ANALYTICS ENSURING SCADA NETWORK CONTINUITY WITH ROUTING AND TRAFFIC ANALYTICS WHITE PAPER Table of Contents The Mandate of Utility Grid Uptime 3 Matching Information Network Reliability to Utility Grid Reliability

More information

Service Level Agreement

Service Level Agreement Service Level Agreement Version 2018.1 Copyright 2018 Aldridge PO Box 56506, Houston, TX 77256-6506 713.403.9150 http://aldridge.com Contents Contents... 2 Agreement... 3 The Aggregate Set of Agreements

More information

Intelligent Application Bypass

Intelligent Application Bypass The following topics describe how to configure access control polices to use (IAB) Introduction to IAB, on page 1 IAB Options, on page 2 Configuring IAB, on page 4 IAB Logging and Analysis, on page 5 Introduction

More information

Minimizing Churn in Distributed Systems

Minimizing Churn in Distributed Systems Minimizing Churn in Distributed Systems by P. Brighten Godfrey, Scott Shenker, and Ion Stoica appearing in SIGCOMM 2006 presented by Todd Sproull Introduction Problem: nodes joining or leaving distributed

More information

PIE in the Sky : Online Passive Interference Estimation for Enterprise WLANs

PIE in the Sky : Online Passive Interference Estimation for Enterprise WLANs WiNGS Labs PIE in the Sky : Online Passive Interference Estimation for Enterprise WLANs * Nokia Research Center, Palo Alto Shravan Rayanchu, Suman Banerjee University of Wisconsin-Madison Konstantina Papagiannaki

More information

10 BEST PRACTICES TO STREAMLINE NETWORK MONITORING. By: Vinod Mohan

10 BEST PRACTICES TO STREAMLINE NETWORK MONITORING. By: Vinod Mohan 10 BEST PRACTICES TO STREAMLINE NETWORK MONITORING By: Vinod Mohan 10 Best Practices to Streamline Network Monitoring Introduction As a network admin, you are tasked with keeping your organization s network

More information