NetPilot: Automating Datacenter Network Failure Mitigation
|
|
- Edmund Booker
- 5 years ago
- Views:
Transcription
1 NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang
2 Failures are Common and Harmful Network failures are common 10,000+ switches 2
3 Failures are Common and Harmful Network failures are common Failures cause long down times 3
4 Failures are Common and Harmful Six-month failure logs of production datacenters Network failures are common 25% of failures take 13+ hours to repair Failures cause long down times Time from detection to repair (minutes) 4
5 Failures are Common and Harmful Failures are common due to VERY large datacenters Failures cause long down times Long failure duration large revenue loss 5
6 Failures are Common and Harmful Failures are common due to VERY large datacenters Failures cause long down times Long failure duration large revenue loss 6
7 How to Shorten Failure Recovery Time?
8 Previous Work Conventional failure recovery takes 3 steps Detection Diagnosis Repair passive ping active 8
9 Previous Work Conventional failure recovery takes 3 steps Detection Diagnosis Repair Failure localization/diagnosis [M. K. Aguilera, SOSP 03] [M. Y. Chen, NSDI 04] [R.R Kompella, NSDI 05] [P.Bahl, SIGCOMM 07] [S. Kandula, SIGCOMM 09] 9
10 Automating Failure Diagnosis is Challenging Root causes are deep in network stack Diagnosis involves multiple parties 10
11 Category Failure types Diagnosis & Repair Software 21% Link layer loop Find and fix 19% Imbalance overload bugs 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% Unknown 23% Switch stops forwarding N/A 9% Configuration 38% Imbalance overload 7% Lost configuration 5% High CPU utilization 2% Errors on multiple switches Update configuration 1. Root causes are deep in the network stack % 32% Errors on one switch 6% Six-month failure logs from several production DCNs 11
12 Category Failure types Diagnosis & Repair Software 21% Link layer loop Find and fix 19% Imbalance overload bugs 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% Unknown 23% Switch stops forwarding N/A 9% Configuration 38% Imbalance overload 7% Lost configuration 5% High CPU utilization 2% Errors on multiple switches 2. Diagnosis involves multiple parties Update configuration 1. Root causes are deep in the network stack % 32% Errors on one switch 6% Six-month failure logs from several production DCNs 12
13 Category Failure types Diagnosis & Repair Software 21% Link layer loop Find and fix 19% Imbalance overload bugs 2% Hardware 18% FCS error Replace cable 13% Unstable power Repair power 5% 2. Diagnosis involves Failure Diagnosis Imbalance multiple overload parties Requires 7% Unknown 23% Switch stops forwarding N/A 9% Configuration 38% Lost configuration 5% Human Intervention! High CPU utilization 2% Errors on multiple switches Update configuration 1. Root causes are deep in the network stack % 32% Errors on one switch 6% Six-month failure logs from several production DCNs 13
14 Can we do something other than failure diagnosis?
15 NetPilot: Mitigating rather than Diagnosing Failures Mitigate failure symptoms ASAP, at the cost of reduced capacity Detection Diagnosis Repair 15
16 16
17 NetPilot Benefits Short recovery time Small network disruption Low operation cost Automated Detection Diagnosis Repair Mitigation 17
18 Failure Mitigation is Effective Most failures can be mitigated by simple actions Mitigation is feasible due to redundancy 18
19 Category Failure types Mitigation Repair % Software 21% Hardware 18% Unknown 23% Configurati on 38% Link layer loop Deactivate port Find and fix Imbalancetriggered Restart switch bugs overload 19% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Switch stops forwarding Imbalancetriggered overload 2% Restart switch N/A 9% Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Errors on multiple switches Errors on single switch Restart switch 2% n/a Update configuration 32% Deactivate switch 6% 19
20 Category Failure types Mitigation Repair % Software 21% Hardware 18% Unknown 23% Configurati on 38% Link layer loop Deactivate port Find and fix Imbalancetriggered Restart switch bugs overload 19% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% Switch stops forwarding Imbalancetriggered overload 2% Restart switch N/A 9% Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Errors on multiple switches Errors on single switch Restart switch 2% n/a Update configuration 32% Deactivate switch 6% 20
21 Category Failure types Mitigation Repair % Software 21% Hardware 18% Unknown 23% Configurati on 38% Link layer loop Deactivate port Find and fix Imbalancetriggered Restart switch bugs overload 19% FCS error Deactivate port Replace cable 13% Unstable power Deactivate switch Repair power 5% 68% of failures can be Switch stops forwarding Imbalancetriggered overload 2% Restart switch N/A 9% mitigated by simple actions Restart switch 7% Lost configuration Restart switch 5% High CPU utilization Errors on multiple switches Errors on single switch Restart switch 2% n/a Update configuration 32% Deactivate switch 6% 21
22 22
23 23
24 Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? NetPilot evaluations Conclusion 24
25 A Strawman NetPilot: Trial-and-error Network failure Localization Roll back if necessary Execute an action No Failure mitigated? Yes End 25
26 NetPilot: Challenges & Solutions Network failure Localization 1. Blind trial-and-error Roll back if necessary Execute an action takes a long time No Failure mitigated? Yes End 26
27 NetPilot: Challenges & Solutions Network failure Localization Roll back if necessary Execute an action 1. Blind trial-and-error takes a long time Failure specific localization No Failure mitigated? Yes End 27
28 NetPilot: Challenges & Solutions Network failure Localization Roll back if necessary Estimate impact Execute an action 2. Partition/overload network Impact estimation No Failure mitigated? Yes End 28
29 29
30 30
31 NetPilot: Challenges & Solutions Network failure Localization Roll back if necessary Estimate impact Rank actions Execute an action 3. Different actions have different side-effects Rank actions based on impact No Failure mitigated? Yes End 31
32 Failure Specific Localization Limited # of failure types Domain knowledge improves accuracy Failure types 1. Link layer loop 2. Imbalance-triggered overload 3. FCS error 4. Unstable power 5. Switch stops forwarding 6. Imbalance-triggered overload 7. Lost configuration 8. High CPU utilization 9. Errors on multiple switches 10. Errors on single switch 32
33 Example: Frame Check Sequence (FCS) Errors 13% of all the failures Cut-through switching Forward frames before checksums are verified Increase application latency 33
34 Localizing FCS Errors error frames seen on L frames corrupted by L frames corrupted by other links & traverse L x L : link corruption rate # of variables = # of equations = # of links Corrupted links: x L > 0 34
35 NetPilot Overview Network failure Localization Estimate impact Roll back if necessary Rank actions Execute an action No Failure mitigated? Yes End 35
36 Impact Metrics Derived from Service Level Agreement (SLA) Availability: online_server_ratio Packet loss: total_lost_pkt latency: max_link_utilization Small link utilization small (queuing) delay Total_lost_pkt & max_link_utilization derived from utilization of individual links 36
37 Estimating Link Utilization Action Traffic Topology Impact Estimator Link utilization # of flows >> redundant paths Traffic evenly distributed under ECMP Estimate the load contributed by each flow on each link Sum up the loads to compute utilization 37
38 Link Utilization Estimation is Highly Accurate 1-month traffic from a 8000-server network Log socket events on each server Ground truth: SNMP counters 38
39 NetPilot Overview Network failure Localization Roll back if necessary Estimate impact Rank actions Execute an action Choose the action with the least impact No Failure mitigated? Yes End 39
40 Outline Automating failure diagnosis is challenging Failure mitigation is effective How to automate mitigation? Localization impact estimation ranking NetPilot evaluations Mitigating load imbalance Mitigating FCS errors Mitigating overload Conclusion 40
41 Load Imbalance Agg a stops receiving traffic Localize to 4 suspects core a core b Agg a Agg b 41
42 42
43 43
44 44
45 45
46 46
47 47
48 Fast FCS Error Mitigation Human operator: after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated NetPilot: deactivates 2 links in 1 trial within 15 minutes 48
49 Fast FCS Error Mitigation 3.5 hours 15 minutes Human operator: after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated NetPilot: deactivates 2 links in 1 trial within 15 minutes 49
50 50
51 51
52 Mitigating Link Overload Mitigate overload by deactivating healthy links Many candidate links in production networks Choose the link(s) with the least impact core 1 core 2 core 1 core 2 core 1 core 2 agg agg agg 3 lost
53 Action Ranking Lowers Link Utilization Replay 97 overload incidents due to link failures 53
54 Conclusion Mitigation reduces failure recovery time Simple actions are effective Made possible by redundancy NetPilot: automating failure mitigation Recovery time: hour minutes Several mitigation scenarios deployed in Bing 54
55 Thank You! NetPilot: Detection Automated Diagnosis Repair Mitigation 55
56 56
57 NetPilot Shortens Recovery Time Time from detection to mitigation 6 months, many production datacenters NetPilot mitigate 3 types of failures all with in 30 minutes Operators work around 50% failures in 2 HOURS 57
Automated Bug Removal for Software-Defined Networks
Automated Bug Removal for Software-Defined Networks Yang Wu* Ang Chen* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University of Pennsylvania + Georgetown University 1 Motivation: Automated repair
More informationDeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz
DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz 1 A Typical Facebook Page Modern pages have many components
More informationWhy do Internet services fail, and what can be done about it?
Why do Internet services fail, and what can be done about it? David Oppenheimer, Archana Ganapathi, and David Patterson Computer Science Division University of California at Berkeley 4 th USENIX Symposium
More informationDemocratically Finding The Cause of Packet Drops
Democratically Finding The Cause of Packet Drops Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang (Harry) Liu, Jitu Padhye, Geoff Outhred, Boon Thau Loo 1 Marple- SigComm 2017 Sherlock- SigComm
More informationDeepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure
Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang 1, Guo Yu 2, Chuanxiong Guo 3, Yingnong Dang 4, Nick Swanson 4, Xinsheng Yang 4, Randolph Yao 4, Murali Chintalapati
More informationInternet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011
Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang, Zhuoqing Morley MaoJia Wang3, Lixin Gao and
More informationSherlock Diagnosing Problems in the Enterprise
Sherlock Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang Enterprise Management: Between a Rock and a Hard Place Manageability
More informationSolving Practical Problems in Datacenter Networks
Solving Practical Problems in Datacenter Networks by Xin Wu Department of Computer Science Duke University Date: Approved: Xiaowei Yang, Supervisor Bruce Maggs Jeffrey Chase Romit Roy Choudhury Dissertation
More informationAutomatic Life Cycle Management of Network Configurations
Hongqiang Harry Liu, Xin Wu, Wei Zhou, Weiguo Chen, Tao Wang, Hui Xu, Lei Zhou, Qing Ma, Ming Zhang Alibaba Group ABSTRACT Managing the life cycle of network configurations, including the generation, update,
More informationThe Day the DNS Died
The Day the DNS Died Jeremy Blosser, Principal Operations Engineer jblosser@sparkpost.com https://tinyurl.com/spdnstalk 1 Introduction SparkPost, aka Message Systems, is a high-volume, transactional email
More informationData Center TCP (DCTCP)
Data Center Packet Transport Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Cloud computing
More informationLecture 15: Datacenter TCP"
Lecture 15: Datacenter TCP" CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Mohammad Alizadeh Lecture 15 Overview" Datacenter workload discussion DC-TCP Overview 2 Datacenter Review"
More informationDebugging the Data Plane with Anteater
Debugging the Data Plane with Anteater Haohui Mai, Ahmed Khurshid Rachit Agarwal, Matthew Caesar P. Brighten Godfrey, Samuel T. King University of Illinois at Urbana-Champaign Network debugging is challenging
More informationConfiguring Switch Latency Monitoring
This chapter contains the following sections: Information About Switch Latency Monitoring, page 1 How to Configure Switch Latency Monitoring, page 3 Configuration Examples for Switch Latency Monitoring,
More information10X: Power System Technology 10 Years Ahead of Industry International Standards- Based Communications
10X: Power System Technology 10 Years Ahead of Industry International Standards- Based Communications David Dolezilek International Technical Director Information Technology (IT) Methods Jeopardize Operational
More informationCQNCR: Optimal VM Migration Planning in Cloud Data Centers
CQNCR: Optimal VM Migration Planning in Cloud Data Centers Presented By Md. Faizul Bari PhD Candidate David R. Cheriton School of Computer science University of Waterloo Joint work with Mohamed Faten Zhani,
More informationJuggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets
Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets Rahul Potharaju (Purdue University) Navendu Jain (Microsoft Research) Cristina Nita-Rotaru (Purdue University) April
More informationImproving Network Agility with Seamless BGP Reconfigurations
Improving Network Agility with Seamless BGP Reconfigurations Corneliu Claudiu Prodescu School of Engineering and Sciences Jacobs University Bremen Campus Ring 1, 28759 Bremen, Germany Monday 18 th March,
More informationConfigure IP SLA Tracking for IPv4 Static Routes on an SG550XG Switch
Configure IP SLA Tracking for IPv4 Static Routes on an SG550XG Switch Introduction When using static routing, you may experience a situation where a static route is active, but the destination network
More informationA Network-State Management Service. Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft
A Network-State Management Service Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft Complex Infrastructure 1 Complex Infrastructure Microsoft Azure Number
More informationData Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead. Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, Wenchao Zhou
Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, Wenchao Zhou Motivation D E Alice A B C foo.com An example scenario:
More information"Charting the Course... TSHOOT Troubleshooting and Maintaining Cisco IP Networks Course Summary
Course Summary Description This course is designed to help network professionals improve the skills and knowledge that they need to maintain their network and to diagnose and resolve network problems quickly
More informationData Center TCP (DCTCP)
Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research Stanford University 1
More informationMedia Path Analysis. Analyzing Media Paths Using IP SLA. Before You Begin. This section contains the following:
This section contains the following: Analyzing Media Paths Using IP SLA, page 1 Analyzing Media Paths Using VSAA, page 3 Managing a Video Test Call, page 6 Analyzing Media Paths Using IP SLA To start a
More informationMULTINATIONAL BANKING CORPORATION INVESTS IN ROUTE ANALYTICS TO AVOID OUTAGES
MULTINATIONAL BANKING CORPORATION INVESTS IN ROUTE ANALYTICS TO AVOID OUTAGES CASE STUDY Table of Contents Organization Background and Network Summary 3 Outage Precursor and Impact 3 Outage Analysis 4
More informationConsistent SDN Flow Migration aided by Optical Circuit Switching. Rafael Lourenço December 2nd, 2016
Consistent SDN Flow Migration aided by Optical Circuit Switching Rafael Lourenço December 2nd, 2016 What is Flow Migration? Installation/update of new rules on [multiple] [asynchronous] SDN switches to
More informationThe fundamentals of Ethernet!
Building Ethernet Connectivity Services for Provider Networks" " Eduard Bonada i Cruells" Tesi Doctoral UPF / 2012 Dirigida per Dra. Dolors Sala i Batlle Departament de Tecnologies de la Informació i les
More informationCisco I/O Accelerator Deployment Guide
Cisco I/O Accelerator Deployment Guide Introduction This document provides design and configuration guidance for deploying the Cisco MDS 9000 Family I/O Accelerator (IOA) feature, which significantly improves
More informationTraffic Engineering with Forward Fault Correction
Traffic Engineering with Forward Fault Correction Harry Liu Microsoft Research 06/02/2016 Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter 1 Cloud services require large
More informationDeadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen
Deadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen Presenter: Haiying Shen Associate professor *Department of Electrical and Computer Engineering, Clemson University,
More informationIP SLAs Overview. Finding Feature Information. Information About IP SLAs. IP SLAs Technology Overview
This module describes IP Service Level Agreements (SLAs). IP SLAs allows Cisco customers to analyze IP service levels for IP applications and services, to increase productivity, to lower operational costs,
More informationFinding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network
Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University)
More informationLON2 is Live LINX May
LON2 21/05/18 LON2 is Live LINX 101 21 May 2018 2 The New Network LINX 101 3 The Live Network LINX 101 21 May 2018 4 Remaining ExtremeLAN LINX 101 21 May 2018 5 New Technologies > IP Fabric leaf/spine
More informationCisco SAN Analytics and SAN Telemetry Streaming
Cisco SAN Analytics and SAN Telemetry Streaming A deeper look at enterprise storage infrastructure The enterprise storage industry is going through a historic transformation. On one end, deep adoption
More informationPort Tapping Session 2 Race tune your infrastructure
Port Tapping Session 2 Race tune your infrastructure Born on Oct 30 th 2012. 2 3 Tap Module Red adapter indicates TAP port 4 Corning Fibre Channel and Ethernet Tap s 72 Ports per 1U 288 Ports per 4U 5
More informationLecture 16: Data Center Network Architectures
MIT 6.829: Computer Networks Fall 2017 Lecture 16: Data Center Network Architectures Scribe: Alex Lombardi, Danielle Olson, Nicholas Selby 1 Background on Data Centers Computing, storage, and networking
More informationIPv6 Management 101 Share Session Anaheim
IPv6 Management 101 Share Session Anaheim Laura Knapp WW Business Consultant Laurak@aesclever.com 07/27/2012 Applied Expert Systems, Inc. 2012 1 The Past What network protocols did you run before 1990?
More informationCutting the Cord: A Robust Wireless Facilities Network for Data Centers
Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Yibo Zhu, Xia Zhou, Zengbin Zhang, Lin Zhou, Amin Vahdat, Ben Y. Zhao and Haitao Zheng U.C. Santa Barbara, Dartmouth College, U.C.
More informationConfiguring Cisco IOS IP SLAs Operations
CHAPTER 50 This chapter describes how to use Cisco IOS IP Service Level Agreements (SLAs) on the switch. Cisco IP SLAs is a part of Cisco IOS software that allows Cisco customers to analyze IP service
More informationPathMon: Path-Specific Traffic Monitoring in OpenFlow-Enabled Networks
PathMon: Path-Specific Traffic Monitoring in OpenFlow-Enabled Networks Ming-Hung Wang, Shao-You Wu, Li-Hsing Yen, and Chien-Chao Tseng Dept. Computer Science, National Chiao Tung University Hsinchu, Taiwan,
More informationRDMA over Commodity Ethernet at Scale
RDMA over Commodity Ethernet at Scale Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, Marina Lipshteyn ACM SIGCOMM 2016 August 24 2016 Outline RDMA/RoCEv2 background DSCP-based
More informationConfiguring Cisco IOS IP SLA Operations
CHAPTER 58 This chapter describes how to use Cisco IOS IP Service Level Agreements (SLA) on the switch. Cisco IP SLA is a part of Cisco IOS software that allows Cisco customers to analyze IP service levels
More informationQuickly Pinpoint and Resolve Problems in Windows /.NET Applications TECHNICAL WHITE PAPER
Quickly Pinpoint and Resolve Problems in Windows /.NET Applications TECHNICAL WHITE PAPER Table of Contents Executive Overview...1 Problem Resolution A Major Time Consumer...2 > Inefficiencies of the Problem
More informationDevoFlow: Scaling Flow Management for High-Performance Networks
DevoFlow: Scaling Flow Management for High-Performance Networks Andy Curtis Jeff Mogul Jean Tourrilhes Praveen Yalagandula Puneet Sharma Sujata Banerjee Software-defined networking Software-defined networking
More informationConfiguring StackWise Virtual
Finding Feature Information, page 1 Restrictions for Cisco StackWise Virtual, page 1 Prerequisites for Cisco StackWise Virtual, page 2 Information About Cisco Stackwise Virtual, page 2 Cisco StackWise
More informationTowards Predictable + Resilient Multi-Tenant Data Centers
Towards Predictable + Resilient Multi-Tenant Data Centers Presenter: Ali Musa Iftikhar (Tufts University) in joint collaboration with: Fahad Dogar (Tufts), {Ihsan Qazi, Zartash Uzmi, Saad Ismail, Gohar
More informationKEMP 360 Vision. KEMP 360 Vision. Product Overview
KEMP 360 Vision Product Overview VERSION: 1.0 UPDATED: SEPTEMBER 2016 Table of Contents 1 Introduction... 3 1.1 Document Purpose... 3 1.2 Intended Audience... 3 2 Architecture... 4 3 Sample Scenarios...
More informationNetwork Survivability
Network Survivability Bernard Cousin Outline Introduction to Network Survivability Types of Network Failures Reliability Requirements and Schemes Principles of Network Recovery Performance of Recovery
More informationManaged WAN SLA. Contents
Managed WAN SLA Contents Terminology... 2 Service Description... 2 General... 2 Levels and Offerings... 2 Private Network Services... 2 Features... 2 Internet Access... 3 Features... 3 Service Level Metrics...
More informationTo Filter or to Authorize: Network-Layer DoS Defense against Multimillion-node Botnets. Xiaowei Yang Duke Unversity
To Filter or to Authorize: Network-Layer DoS Defense against Multimillion-node Botnets Xiaowei Yang Duke Unversity Denial of Service (DoS) flooding attacks Send packet floods to a targeted victim Exhaust
More informationSWAN: Software-driven wide area network. Ratul Mahajan
SWAN: Software-driven wide area network Ratul Mahajan Partners in crime Vijay Gill Chi-Yao Hong Srikanth Kandula Ratul Mahajan Mohan Nanduri Ming Zhang Roger Wattenhofer Rohan Gandhi Xin Jin Harry Liu
More informationManaged WAN SLA. Contents
Managed WAN SLA Contents Terminology... 2 Service Description... 2 Service Offerings... 2 Private Network Services... 2 Ethernet Connectivity... 2 T-1 Connectivity... 3 Other Connectivity... 3 Internet
More informationService Recovery & Availability. Robert Dickerson June 2010
Service Recovery & Availability Robert Dickerson June 2010 Started in 1971 with $3,000, 40 clients and 1 employee. 2009: over $2B revenue, 500,000+ clients, 13,000 employees. Payroll / Tax Services / 401(k)
More informationBGP#: A System for Dynamic Route Control In Data Centers
BGP#: A System for Dynamic Route Control In Data Centers Chao-Chih Chen UC Davis* Lihua Yuan Albert Greenberg Randy Kern Tao Zhang Parantap Lahiri John Arnold Kevin Grady Microsoft *Also a Microsoft Intern
More informationDetailed diagnosis in enterprise networks. Network diagnosis
Detailed diagnosis in enterprise networks Srikanth Kandula, Ratul Mahajan, Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl Network diagnosis Explaining faulty behavior 1 Current landscape
More informationProtecting remote site data SvSAN clustering - failure scenarios
White paper Protecting remote site data SvSN clustering - failure scenarios Service availability and data integrity are key metrics for enterprises that run business critical applications at multiple remote
More informationTesting Video over IP Product and Services
GIGANET S Y S T E M S Precision Performance Repeatability Testing Video over IP Product and Services Application Note Introduction Video over IP has gone mainstream. Over the last few years, the number
More informationHigh Availability and Disaster Recovery Solutions for Perforce
High Availability and Disaster Recovery Solutions for Perforce This paper provides strategies for achieving high Perforce server availability and minimizing data loss in the event of a disaster. Perforce
More informationRoot-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions
Root-Cause Network Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions 1 Agenda Business disconnect Why is troubleshooting so hard? Troubleshooting methodology Tool selection Finding the
More informationConfiguring IP SLAs LSP Health Monitor Operations
Configuring IP SLAs LSP Health Monitor Operations This module describes how to configure an IP Service Level Agreements (SLAs) label switched path (LSP) Health Monitor. LSP health monitors enable you to
More informationCompTIA Mobility+ Certification
CompTIA Mobility+ Certification Duration: 5 days Price: $4000 Certifications: CompTIA Mobility+ Exams: MB0-001 Course Overview The mobile age is upon us. More and more people are using tablets, smartphones,
More informationPricing Intra-Datacenter Networks with
Pricing Intra-Datacenter Networks with Over-Committed Bandwidth Guarantee Jian Guo 1, Fangming Liu 1, Tao Wang 1, and John C.S. Lui 2 1 Cloud Datacenter & Green Computing/Communications Research Group
More informationReference Architecture. 28 MAY 2018 vrealize Operations Manager 6.7
28 MAY 2018 vrealize Operations Manager 6.7 You can find the most up-to-date technical documentation on the VMware website at: https://docs.vmware.com/ If you have comments about this documentation, submit
More informationHigh Availability Architectures for Ethernet in Manufacturing
High Availability Architectures for Ethernet in Manufacturing Written by: Paul Wacker, Advantech Corporation, Industrial Automation Group Outside of craft manufacture, like blacksmithing, or custom jewelry
More informationGet the skills to maintain your networks and to diagnose and resolve network problems quickly and effectively.
Cisco CCNP - HD Telepresence TSHOOT: Troubleshooting and Maintaining Cisco IP Networks (TSHOOT) 2.0 Get the skills to maintain your networks and to diagnose and resolve network problems quickly and effectively.
More informationApproaches for Resilience Against Cascading Failures in Cloud Datacenters
Approaches for Resilience Against Cascading Failures in Cloud Datacenters Haoyu Wang, Haiying Shen and Zhuozhao Li Department of Electrical and Computer Engineering University of Virginia Email: {hw8c,
More informationCutting the Cord: A Robust Wireless Facilities Network for Data Centers
Cutting the Cord: A Robust Wireless Facilities Network for Data Centers Yibo Zhu, Xia Zhou, Zengbin Zhang, Lin Zhou, Amin Vahdat, Ben Y. Zhao and Haitao Zheng U.C. Santa Barbara, Dartmouth College, U.C.
More informationReference Architecture
vrealize Operations Manager 6.5 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent
More informationPreFix: Switch Failure Prediction in Datacenter Networks
1 PreFix: Switch Failure Prediction in Datacenter Networks Joint work with Sen Yang 4 Shenglin Zhang 1, Ying Liu 2, Weibin Meng 2, Zhiling Luo 3, Jiahao Bu 2, Peixian Liang 5, Dan Pei 2, Jun Xu 4, Yuzhi
More informationPer-Packet Load Balancing in Data Center Networks
Per-Packet Load Balancing in Data Center Networks Yagiz Kaymak and Roberto Rojas-Cessa Abstract In this paper, we evaluate the performance of perpacket load in data center networks (DCNs). Throughput and
More informationAppDynamics Lite vs. Pro Edition
An AppDynamics Datasheet AppDynamics Lite vs. Pro Edition AppDynamics, the leader in application performance management (APM) for the cloud generation, offers both a Lite and Pro edition of its monitoring
More informationFuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc
Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,
More informationRELEASING LATENT VALUE DOCUMENT: CA NETMASTER NETWORK MANAGEMENT R11.5. Releasing the Latent Value of CA NetMaster Network Management r11.
RELEASING LATENT VALUE DOCUMENT: CA NETMASTER NETWORK MANAGEMENT R11.5 Releasing the Latent Value of CA NetMaster Network Management r11.5 Table of Contents Product Situation Analysis Analysis of Problems
More informationVMware vcenter Operations Manager Getting Started Guide
VMware vcenter Operations Manager Getting Started Guide Custom User Interface vcenter Operations Manager 5.7 This document supports the version of each product listed and supports all subsequent versions
More informationCollisions & Virtual collisions in IEEE networks
Collisions & Virtual collisions in IEEE 82.11 networks Libin Jiang EE228a project report, Spring 26 Abstract Packet collisions lead to performance degradation in IEEE 82.11 [1] networks. The carrier-sensing
More informationReference Architecture. Modified on 17 AUG 2017 vrealize Operations Manager 6.6
Modified on 17 AUG 2017 vrealize Operations Manager 6.6 You can find the most up-to-date technical documentation on the VMware Web site at: https://docs.vmware.com/ The VMware Web site also provides the
More informationPresented by: Fabián E. Bustamante
Presented by: Fabián E. Bustamante A. Nikravesh, H. Yao, S. Xu, D. Choffnes*, Z. Morley Mao Mobisys 2015 *Based on the authors slides Mobile apps are increasingly popular Mobile platforms is the dominant
More informationisco Understanding Spanning Tree Protocol Topology Chan
isco Understanding Spanning Tree Protocol Topology Chan Table of Contents Understanding Spanning Tree Protocol Topology Changes...1 Interactive: This document offers customized analysis of your Cisco device...1
More informationWhat Does the EIGRP DUAL 3 SIA Error Message Mean?
What Does the EIGRP DUAL 3 SIA Error Message Mean? Document ID: 13676 Contents Introduction Prerequisites Requirements Components Used Conventions Background Information What Causes the EIGRP DUAL 3 SIA
More informationOracle Java SE Advanced for ISVs
Oracle Java SE Advanced for ISVs Oracle Java SE Advanced for ISVs is designed to enhance the Java based solutions that ISVs are providing to their enterprise customers. It brings together industry leading
More informationA SKY Computers White Paper
A SKY Computers White Paper High Application Availability By: Steve Paavola, SKY Computers, Inc. 100000.000 10000.000 1000.000 100.000 10.000 1.000 99.0000% 99.9000% 99.9900% 99.9990% 99.9999% 0.100 0.010
More informationVenice: Reliable Virtual Data Center Embedding in Clouds
Venice: Reliable Virtual Data Center Embedding in Clouds Qi Zhang, Mohamed Faten Zhani, Maissa Jabri and Raouf Boutaba University of Waterloo IEEE INFOCOM Toronto, Ontario, Canada April 29, 2014 1 Introduction
More informationData Center TCP (DCTCP)
Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Stanford University MicrosoD Research Case
More informationRhapsody Interface Management and Administration
Rhapsody Interface Management and Administration Welcome The Rhapsody Framework Rhapsody Processing Model Application and persistence store files Web Management Console Backups Route, communication and
More informationSynology High Availability (SHA)
Synology High Availability (SHA) Based on DSM 5.1 Synology Inc. Synology_SHAWP_ 20141106 Table of Contents Chapter 1: Introduction... 3 Chapter 2: High-Availability Clustering... 4 2.1 Synology High-Availability
More informationReliability in White Rabbit Network
Reliability in White Rabbit Network Maciej Lipiński Hardware and Timing Section / Institute of Electronic Systems CERN / Warsaw University of Technology February 8 & 9, 2013 Wilga Symposium Warsaw Maciej
More informationGamma Service Incident Report Final 18/9/14
Gamma Service Report Final 18/9/14 Broadband Service Please read the following as it could have an impact on some of your customers. Reference: Start Date: Start Time: Actual Clear Date: Actual Clear Time:
More informationResilience Validation
Resilience Validation Ramkumar Natarajan & Manager - NFT Cognizant Technology Solutions Abstract With the greater complexities in application and infrastructure landscapes, the risk of failure is ever
More informationTCP WISE: One Initial Congestion Window Is Not Enough
TCP WISE: One Initial Congestion Window Is Not Enough Xiaohui Nie $, Youjian Zhao $, Guo Chen, Kaixin Sui, Yazheng Chen $, Dan Pei $, MiaoZhang, Jiyang Zhang $ 1 Motivation Web latency matters! latency
More informationWindows Azure Services - At Different Levels
Windows Azure Windows Azure Services - At Different Levels SaaS eg : MS Office 365 Paas eg : Azure SQL Database, Azure websites, Azure Content Delivery Network (CDN), Azure BizTalk Services, and Azure
More informationDetection and Localization of Network Black Holes
Detection and Localization of Network Black Holes Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C. Snoeren University of California, San Diego, AT&T Labs Research, Micorsoft Research
More informationAuthors: Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy, Thomas Anderson, Jie Gao
Title: Moving Beyond End-to-End Path Information to Optimize CDN Performance Authors: Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy, Thomas Anderson, Jie Gao
More informationConfiguring Cisco StackWise Virtual
Finding Feature Information, page 1 Restrictions for Cisco StackWise Virtual, page 1 Prerequisites for Cisco StackWise Virtual, page 3 Information About Cisco Stackwise Virtual, page 3 Cisco StackWise
More informationMonitor Qlik Sense sites. Qlik Sense Copyright QlikTech International AB. All rights reserved.
Monitor Qlik Sense sites Qlik Sense 2.1.2 Copyright 1993-2015 QlikTech International AB. All rights reserved. Copyright 1993-2015 QlikTech International AB. All rights reserved. Qlik, QlikTech, Qlik Sense,
More informationENSURING SCADA NETWORK CONTINUITY WITH ROUTING AND TRAFFIC ANALYTICS
ENSURING SCADA NETWORK CONTINUITY WITH ROUTING AND TRAFFIC ANALYTICS WHITE PAPER Table of Contents The Mandate of Utility Grid Uptime 3 Matching Information Network Reliability to Utility Grid Reliability
More informationService Level Agreement
Service Level Agreement Version 2018.1 Copyright 2018 Aldridge PO Box 56506, Houston, TX 77256-6506 713.403.9150 http://aldridge.com Contents Contents... 2 Agreement... 3 The Aggregate Set of Agreements
More informationIntelligent Application Bypass
The following topics describe how to configure access control polices to use (IAB) Introduction to IAB, on page 1 IAB Options, on page 2 Configuring IAB, on page 4 IAB Logging and Analysis, on page 5 Introduction
More informationMinimizing Churn in Distributed Systems
Minimizing Churn in Distributed Systems by P. Brighten Godfrey, Scott Shenker, and Ion Stoica appearing in SIGCOMM 2006 presented by Todd Sproull Introduction Problem: nodes joining or leaving distributed
More informationPIE in the Sky : Online Passive Interference Estimation for Enterprise WLANs
WiNGS Labs PIE in the Sky : Online Passive Interference Estimation for Enterprise WLANs * Nokia Research Center, Palo Alto Shravan Rayanchu, Suman Banerjee University of Wisconsin-Madison Konstantina Papagiannaki
More information10 BEST PRACTICES TO STREAMLINE NETWORK MONITORING. By: Vinod Mohan
10 BEST PRACTICES TO STREAMLINE NETWORK MONITORING By: Vinod Mohan 10 Best Practices to Streamline Network Monitoring Introduction As a network admin, you are tasked with keeping your organization s network
More information