Lessons Learned Operating Active/Active Data Centers Ethan Banks, CCIE

Lessons Learned Operating Active/Active Data Centers Ethan Banks, CCIE #20655 @ecbanks Senior Network Architect, Carenection Co-founder, Packet Pushers Interactive http://ethancbanks.com http://packetpushers.net

Who is Ethan Banks? Senior Network Architect @ Carenection, CCIE #20655. Podcaster @ PacketPushers.net. Writer @ NetworkComputing.com & EthanCBanks.com. Infrastructure track chair @ Interop. ethan.banks@packetpushers.net @ecbanks

Defining Active/Active for this session. A/A is not disaster recovery, but rather disaster avoidance. A/A allows you to serve your application from two or more locations in realtime or near real-time. One site might be preferred over another, but sychronization of storage and database tiers are maintained between sites. In our session, we will consider a web application in a dual DC A/A design where DNS is used for load-balancing & failure recovery.

Reference Diagram DC Green DC Blue INTERNET Customers

DNS TTL is sometimes ignored. One strategy for switching an app to a different data center is low TTLs. PROBLEM: Not all clients will honor the TTL. RESULT: Some clients are stuck to the old data center. MITIGATION: Understand client behavior, recommend best practices to customers or developers.

announcements take time. at the Internet edge is used to announce public address space to the Internet. PROBLEM: During a failover event, a convergence can take several minutes to complete. Note carriers filter your announcements, can t announce on the fly. RESULT: Some clients are heading to the wrong location during convergence. MITIGATION: Announce all routes from all locations, but use prepending or other metrics to influence inbound traffic. Work with your carrier. End up with faster convergence.

Reference Diagram DC Green DC Blue INTERNET Customers

s need symmetric traffic flows. Stateful firewalls are used at the Internet edge and between DMZ/trusted networks. PROBLEM: Asymmetric traffic breaks stateful inspection, causing flow termination. RESULT: Broken client connectivity. Sessions drop. MITIGATION: Enforce path. Use proxies or NAT (!). Alternately, mirror firewall state tables between data centers (hard, unreliable).

Dual DC firewalls need identical policies. In an active/active design, the assumption is that firewall clusters at each DC maintain an identical security policy. PROBLEM: Policy sharing is not assumed by firewall vendors, and must be explicitly configured. RESULT: Traffic flows that are permitted in one DC could be denied in the other, and vice-versa. MITIGATION: Replicate policies on same-tier firewalls.

Reference Diagram DC Green DC Blue INTERNET Customers

s need smart checks to failover. Application delivery controllers use health checks to determine the availability of an applications. PROBLEM: Insufficient checks mean that s can t tell when an application is no longer available. RESULT: does not swing traffic to working pool members, resulting in failed client sessions. MITIGATION: Write multi-level L7 health checks that verify an application is completely available. Think authentication, database access, web server, SSL, etc.

s need symmetric traffic flows. s from A10 and F5 are full TCP proxies, and require that traffic flowing through them is symmetrical. PROBLEM: When failing over to nodes in the opposite data center, return traffic will stay in the opposite DC. RESULT: Broken client sessions during a partial failover event. MITIGATION: NAT to address, tunnel to opposite data center, static route it back to origin data center.

Reference Diagram DC Green DC Blue INTERNET Customers

Latency is a limitation. Acknowledged traffic is limited in throughput by latency. It s math. http://bradhedlund.com/2008/12/19/how-to-calculate-tcp-throughput-for-long-distance-links/ PROBLEM: Waiting around for I got it slows down the transfer. This impacts realtime transactions as well as synchronous storage replication. RESULT: Inter-DC transactions take more time than intra-dc. This can have a cumulative effect. MITIGATION: WAN optimization. Optimize TCP stacks. Don t architect real-time expectations into long-distance geography.

Synchronization can fall behind. Data centers aren t active/active if data sets (storage, database) are not synchronized between sites. PROBLEM: Insufficient bandwidth, latency, or overly large data sets can exceed synchronization windows. DCs become unsynchronized, possibly falling further behind over time. RESULT: Alternate DCs are not able to process traffic when needed. MITIGATION: More bandwidth. Lower latency. WAN optimization.

Broken inter-dc links can break active/active. Often, there s a data processing dependency on the inter-dc link. PROBLEM: AA DC applications often rely on a spider web of data source interactions to do what they do. Think multiple database queries, authentication mechanisms, storage arrays, logging, etc. RESULT: If communications between DCs fails, an application dependency breaks, and the app itself can fail wholly or in part. MITIGATION: Isolate application dependencies to a single DC. Make DC-DC links redundant in all ways plan path diversity with your carrier. Know your routing plan well. Don t forget a tunnel over public path is a viable backup.

Big, noisy flows squash delicate, sensitive flows. Big flows tend to fill pipes, especially with WAN optimization. Data synchronization tends to fill inter-dc pipes. PROBLEM: Latency & jitter for flows like VoIP and other time-sensitive flows such as interactive traffic increase as links approach full. Don t forget about microbursts. RESULT: End-user application experience is lousy. MITIGATION: QoS using priority -- low-latency -- queues. LLQs reserve bandwidth during times of congestion and de-queues traffic on a regular timeinterval. Note that sometimes it s okay to put non-voice traffic in an LLQ.

Reference Diagram DC Green DC Blue INTERNET Customers

Untested failovers don t failover. We build redundant active/active data centers as a disaster avoidance mechanism. PROBLEM: Applications are complicated and evolve over time. RESULT: A DC has a failure and all failover processes work as expected, but the application itself does not work when primary via the other DC. MITIGATION: Test application failover regularly. Quarterly is a good interval. Audit processes with infrastructure team. Don t forget load testing.

Fate sharing is dual DC design nightmare. Part of a dual-dc design is ensuring that something bad happening in one DC doesn t happen in the other. PROBLEM: Improper L2 extension pushes issues such as bridging loops or broadcast storms from one DC to another. RESULT: Two DCs are unable to process. Users lose access to applications. MITIGATION: Maintain separate L2 domains using tools like HP EVI or Cisco OTV. Alternatively, do not stretch VLANs.

To the future! HTTP/2 Cross-site state synchronization EVPN

Thanks & stay in touch! ethan.banks@packetpushers.net @ecbanks EthanCBanks.com PacketPushers.net NetworkComputing.com