Lessons Learned Operating Active/Active Data Centers Ethan Banks, CCIE

Similar documents
Never Drop a Call With TecInfo SIP Proxy White Paper

Technology Brief. VeloCloud Dynamic. Multipath Optimization. Page 1 TECHNOLOGY BRIEF

Introducing Avaya SDN Fx with FatPipe Networks Next Generation SD-WAN

RTO/RPO numbers for different resiliency scenarios

Document Number: rev D Intuitive Surgical, Inc. OnSite Overview. for the da Vinci Xi and da Vinci Si Surgical System.

Data Center Interconnect Solution Overview

Security and Reliability of the SoundBite Platform Andy Gilbert, VP of Operations Ed Gardner, Information Security Officer

Troubleshooting VoIP in Converged Networks

Internet Load Balancing Guide. Peplink Balance Series. Peplink Balance. Internet Load Balancing Solution Guide

AT&T SD-WAN Network Based service quick start guide

Building Infrastructure for Private Clouds Cloud InterOp 2014"

Configuring QoS CHAPTER

SD-WAN Deployment Guide (CVD)

Vendor: Cisco. Exam Code: Exam Name: Implementing Cisco IP Routing (ROUTE v2.0) Version: Demo

Shim6: Network Operator Concerns. Jason Schiller Senior Internet Network Engineer IP Core Infrastructure Engineering UUNET / MCI

Recommended Network Configurations

HTG XROADS NETWORKS. Network Appliance How To Guide: EdgeBPR (Shaping) How To Guide

Q-Balancer Range FAQ The Q-Balance LB Series General Sales FAQ

10 Reasons your WAN is Broken

Chapter 8. Network Troubleshooting. Part II

Data Center Interconnection

Barracuda Link Balancer

Deploying the BIG-IP System for LDAP Traffic Management

Copyright Link Technologies, Inc.

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

High Availability Options

Network Best Practices for Mitel Connect CLOUD

Voice of the Customer First American Title SD-WAN Transformation

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

Cato Cloud. Global SD-WAN with Built-in Network Security. Solution Brief. Cato Cloud Solution Brief. The Future of SD-WAN. Today.

Modular Policy Framework. Class Maps SECTION 4. Advanced Configuration

Zone-Based Policy Firewall High Availability

Choosing the Right Acceleration Solution

Optimize and Accelerate Your Mission- Critical Applications across the WAN

CTS2134 Introduction to Networking. Module 09: Network Management

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

VeloCloud Cloud-Delivered WAN Fast. Simple. Secure. KUHN CONSULTING GmbH

PLANEAMENTO E GESTÃO DE REDES INFORMÁTICAS COMPUTER NETWORKS PLANNING AND MANAGEMENT

Interchassis Asymmetric Routing Support for Zone-Based Firewall and NAT

CAS CS 556. What to expect? Background? Abraham Matta. Advanced Computer Networks. Increase understanding of fundamentals and design tradeoffs

SONICWALL SECURITY HEALTH CHECK PSO 2017

5 What two Cisco tools can be used to analyze network application traffic? (Choose two.) NBAR NetFlow AutoQoS Wireshark Custom Queuing

Implementation Guide - VPN Network with Static Routing

SaaS Providers. ThousandEyes for. Summary

Abstract. Avaya Solution & Interoperability Test Lab

A Guide to Architecting the Active/Active Data Center

Clientless SSL VPN Overview

Introduction. Hardware and Software. Test Highlights

Lecture 4: Introduction to Computer Network Design

Performance Monitoring Always On Availability Groups. Anthony E. Nocentino

Fundamentals of IP Networking 2017 Webinar Series Part 4 Building a Segmented IP Network Focused On Performance & Security

BIG-IP Local Traffic Management: Basics. Version 12.1

SONICWALL SECURITY HEALTH CHECK SERVICE

Cisco Unified MeetingPlace Integration

Failover Configuration Bomgar Privileged Access

Failover Dynamics and Options with BeyondTrust 3. Methods to Configure Failover Between BeyondTrust Appliances 4

cs/ee 143 Communication Networks

Configuring Failover

EVERYTHING YOU NEED TO KNOW ABOUT NETWORK FAILOVER

One of the big complaints from remote

Service Mesh and Microservices Networking

Managing Performance in Liferay DXP: An Overview of Liferay Connected Services

ISG-600 Cloud Gateway

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Privileged Remote Access Failover Configuration

Network Migration Strategies

Network Configuration Guide

Exam Actual. Higher Quality. Better Service! QUESTION & ANSWER

The Key to Disaster Recovery

Cato Cloud. Software-defined and cloud-based secure enterprise network. Solution Brief

Licenses: Product Authorization Key Licensing

Windows Server System Center Azure Pack

SD-WAN Transform Your Agency

90 % of WAN decision makers cite their

EXAM TCP/IP NETWORKING Duration: 3 hours

CDN TUNING FOR OTT - WHY DOESN T IT ALREADY DO THAT? CDN Tuning for OTT - Why Doesn t It Already Do That?

SONICWALL SECURITY HEALTH CHECK SERVICE

3. What could you use if you wanted to reduce unnecessary broadcast, multicast, and flooded unicast packets?

Modeling an Application with Cisco ACI Multi-Site Policy Manager

Link Aggregation: A Server Perspective

TESTING SD-WAN WITH REAL-WORLD CONDITIONS

EXAM TCP/IP NETWORKING Duration: 3 hours With Solutions

Architecture: Consolidated Platform. Eddie Augustine Major Accounts Manager: Federal

Deploying Cisco ASA Firewall Solutions (FIREWALL v1.0)

Dynamic WAN Selection

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

Cisco ACI Multi-Pod/Multi-Site Deployment Options Max Ardica Principal Engineer BRKACI-2003

Atlas Technology White Paper

About High Availability and Active/Active Clustering

New Features for ASA Version 9.0(2)

The Day the DNS Died

CS519: Computer Networks. Lecture 1 (part 2): Jan 28, 2004 Intro to Computer Networking

Using Wireshark as an Applica1on Support Engineer Tim Poth. Senior Priority Response Analyst Bentley Systems, Inc.

Cisco HyperFlex Systems

Frequently Asked Questions for HP EVI and MDC

SQL Azure. Abhay Parekh Microsoft Corporation

Congestion? What Congestion? Mark Handley

Designing a System. We have lots of tools Tools are rarely interesting by themselves Let s design a system... Steven M. Bellovin April 10,

Proxy server is a server (a computer system or an application program) that acts as an intermediary between for requests from clients seeking

Transcription:

Lessons Learned Operating Active/Active Data Centers Ethan Banks, CCIE #20655 @ecbanks Senior Network Architect, Carenection Co-founder, Packet Pushers Interactive http://ethancbanks.com http://packetpushers.net

Who is Ethan Banks? Senior Network Architect @ Carenection, CCIE #20655. Podcaster @ PacketPushers.net. Writer @ NetworkComputing.com & EthanCBanks.com. Infrastructure track chair @ Interop. ethan.banks@packetpushers.net @ecbanks

Defining Active/Active for this session. A/A is not disaster recovery, but rather disaster avoidance. A/A allows you to serve your application from two or more locations in realtime or near real-time. One site might be preferred over another, but sychronization of storage and database tiers are maintained between sites. In our session, we will consider a web application in a dual DC A/A design where DNS is used for load-balancing & failure recovery.

Reference Diagram DC Green DC Blue INTERNET Customers

Reference Diagram DC Green DC Blue INTERNET Customers

DNS TTL is sometimes ignored. One strategy for switching an app to a different data center is low TTLs. PROBLEM: Not all clients will honor the TTL. RESULT: Some clients are stuck to the old data center. MITIGATION: Understand client behavior, recommend best practices to customers or developers.

announcements take time. at the Internet edge is used to announce public address space to the Internet. PROBLEM: During a failover event, a convergence can take several minutes to complete. Note carriers filter your announcements, can t announce on the fly. RESULT: Some clients are heading to the wrong location during convergence. MITIGATION: Announce all routes from all locations, but use prepending or other metrics to influence inbound traffic. Work with your carrier. End up with faster convergence.

Reference Diagram DC Green DC Blue INTERNET Customers

s need symmetric traffic flows. Stateful firewalls are used at the Internet edge and between DMZ/trusted networks. PROBLEM: Asymmetric traffic breaks stateful inspection, causing flow termination. RESULT: Broken client connectivity. Sessions drop. MITIGATION: Enforce path. Use proxies or NAT (!). Alternately, mirror firewall state tables between data centers (hard, unreliable).

Dual DC firewalls need identical policies. In an active/active design, the assumption is that firewall clusters at each DC maintain an identical security policy. PROBLEM: Policy sharing is not assumed by firewall vendors, and must be explicitly configured. RESULT: Traffic flows that are permitted in one DC could be denied in the other, and vice-versa. MITIGATION: Replicate policies on same-tier firewalls.

Reference Diagram DC Green DC Blue INTERNET Customers

s need smart checks to failover. Application delivery controllers use health checks to determine the availability of an applications. PROBLEM: Insufficient checks mean that s can t tell when an application is no longer available. RESULT: does not swing traffic to working pool members, resulting in failed client sessions. MITIGATION: Write multi-level L7 health checks that verify an application is completely available. Think authentication, database access, web server, SSL, etc.

s need symmetric traffic flows. s from A10 and F5 are full TCP proxies, and require that traffic flowing through them is symmetrical. PROBLEM: When failing over to nodes in the opposite data center, return traffic will stay in the opposite DC. RESULT: Broken client sessions during a partial failover event. MITIGATION: NAT to address, tunnel to opposite data center, static route it back to origin data center.

Reference Diagram DC Green DC Blue INTERNET Customers

Latency is a limitation. Acknowledged traffic is limited in throughput by latency. It s math. http://bradhedlund.com/2008/12/19/how-to-calculate-tcp-throughput-for-long-distance-links/ PROBLEM: Waiting around for I got it slows down the transfer. This impacts realtime transactions as well as synchronous storage replication. RESULT: Inter-DC transactions take more time than intra-dc. This can have a cumulative effect. MITIGATION: WAN optimization. Optimize TCP stacks. Don t architect real-time expectations into long-distance geography.

Synchronization can fall behind. Data centers aren t active/active if data sets (storage, database) are not synchronized between sites. PROBLEM: Insufficient bandwidth, latency, or overly large data sets can exceed synchronization windows. DCs become unsynchronized, possibly falling further behind over time. RESULT: Alternate DCs are not able to process traffic when needed. MITIGATION: More bandwidth. Lower latency. WAN optimization.

Broken inter-dc links can break active/active. Often, there s a data processing dependency on the inter-dc link. PROBLEM: AA DC applications often rely on a spider web of data source interactions to do what they do. Think multiple database queries, authentication mechanisms, storage arrays, logging, etc. RESULT: If communications between DCs fails, an application dependency breaks, and the app itself can fail wholly or in part. MITIGATION: Isolate application dependencies to a single DC. Make DC-DC links redundant in all ways plan path diversity with your carrier. Know your routing plan well. Don t forget a tunnel over public path is a viable backup.

Big, noisy flows squash delicate, sensitive flows. Big flows tend to fill pipes, especially with WAN optimization. Data synchronization tends to fill inter-dc pipes. PROBLEM: Latency & jitter for flows like VoIP and other time-sensitive flows such as interactive traffic increase as links approach full. Don t forget about microbursts. RESULT: End-user application experience is lousy. MITIGATION: QoS using priority -- low-latency -- queues. LLQs reserve bandwidth during times of congestion and de-queues traffic on a regular timeinterval. Note that sometimes it s okay to put non-voice traffic in an LLQ.

Reference Diagram DC Green DC Blue INTERNET Customers

Untested failovers don t failover. We build redundant active/active data centers as a disaster avoidance mechanism. PROBLEM: Applications are complicated and evolve over time. RESULT: A DC has a failure and all failover processes work as expected, but the application itself does not work when primary via the other DC. MITIGATION: Test application failover regularly. Quarterly is a good interval. Audit processes with infrastructure team. Don t forget load testing.

Fate sharing is dual DC design nightmare. Part of a dual-dc design is ensuring that something bad happening in one DC doesn t happen in the other. PROBLEM: Improper L2 extension pushes issues such as bridging loops or broadcast storms from one DC to another. RESULT: Two DCs are unable to process. Users lose access to applications. MITIGATION: Maintain separate L2 domains using tools like HP EVI or Cisco OTV. Alternatively, do not stretch VLANs.

To the future! HTTP/2 Cross-site state synchronization EVPN

Thanks & stay in touch! ethan.banks@packetpushers.net @ecbanks EthanCBanks.com PacketPushers.net NetworkComputing.com