Architecting a Highly Available Infrastructure: An Overview SLN187. Mark Milow Mike DiPetrillo

Architecting a Highly Available Infrastructure: An Overview SLN187 Mark Milow Mike DiPetrillo

Agenda Why? Solutions Q&A

Quick Stats 2004 Gartner study found an average of $42,000 per hour of downtime Average network experiences 175 hours of downtime a year (98% availability) That s $7.35 million in lost revenue each year An increase to 99% means only $3.68 million That s $3.67 million you can spend on DR

Quick Math: Amazon.com Revenue in 2001: $3.1B/year with 7774 employees Revenue Per Hour: ~$350,000 If outage effects 90% of revenue: ~$320,000 Assume average annual salary is $85,000 $656M/year or $12.5M/week for all staff @ 50 Hours/week: ~ $250,000 per hour If outage effects 80% of employees: ~$200,000 Total is $520,000 per hour of downtime

Why and for What? Different levels Planned disasters (hurricanes, etc.) Unplanned disasters (power outage, tornadoes, etc.) High availability Plan for local (inside the datacenter) failures Disaster recovery Plan for regional (the datacenter is gone) failures

Definitions MTR: Mean Time to Recover CTR: Cost To Recover Tiers: Levels of Recovery Options

Helpful Hints Tier data Break data into tiers of different MTR commits Networking Plan how to get your network to fail with the data People Don t plan on people flying to remote sites Automate Automate as much of the failover as possible

Local High Availability

Local High Availability Features Standard clustering agents Solutions for non-cluster aware apps Stateful and non-stateful failover Failover and fail-back Product vendors Legato, Microsoft, Veritas, Steeleye, Linux Deployment scenarios Physical to virtual Virtual to virtual Poor man s

Physical to Virtual Clustering Primary Server MS Exchange Windows 2000 Shared disks, arrays or SAN storage ESX Server Failover Server MS Exchange Windows 2000 1U, 2-way Rack File / Print Windows NT Data File / Print Windows NT 1U, 2-way Rack Intranet App Server on Windows 2000 Data Intranet App Server on Windows 2000 1U, 2-way Rack Data 4U, 8-way Rackmount with ESX Server

Virtual to Virtual Clustering

Poor Man s Clustering ESX Server 1 ESX Server 2 VM1 VM3 VM2 ON Shared disks arrays, or SAN storage ON VM4 VM6 VM5 VM4 VM5 VM1 VM2 VM6 OFF VM1 VM2 VM3 OFF ON VM3 VM4 VM5 VM6 Dell 4-way Rackmount HP 4-way Rackmount

Local High Availability Considerations Cost MTR Number of virtual machines Disk space Benefits Out-of-the-box solution Stateful failover Inexpensive For any application

OS Based Solutions

OS Based Solutions Features Regular agents Very similar to physical environment Efficiencies from virtual machine architecture Product vendors Legato, Symantec, Veritas Deployment scenarios Physical to virtual Virtual to virtual Trunk of Car

Physical to Virtual Backup Server Tape Array

Virtual to Virtual Backup Server Tape Array

Trunk of Car

OS Based Solutions Considerations Cost Number of virtual machines Bandwidth Disk space Benefits Standard solution No learning curve Great reduction in agent cost

Host Based Solutions

Host Based Solutions Features Agent based, runs as a service File/byte level replication Synchronous and asynchronous Failover and fail-back Product Vendors Legato, NSI Double-Take, NeverFail, Mimix Deployment Scenarios Physical to virtual Virtual to virtual Multi-node

Virtual to Virtual Primary Site Failover Site

Virtual to Virtual Multi-Node Primary Site Failover Site

Host Based Solutions Considerations Cost Distance Number of virtual machines Bandwidth Benefits Out-of-the-box solution Maximum uptime Ease of use

SAN Based Solutions

SAN Based Solutions Features SAN layered applications LUN Snapshot Replication/Mirroring Block level replication Synchronous and asynchronous Products Vendors EMC, HP, IBM, Network Appliance Deployment Scenarios Physical to virtual Virtual to virtual Hybrid

SAN: Virtual to Virtual Host Agent For Replication and Failover Primary Site Failover Site

Hybrid Mode Backup Host Agent For Virtual Machine and Application Failover SAN Replication Primary Site Failover Site

SAN Based Solutions Considerations Cost Distance Bandwidth Downtime Complex configuration Benefits High performance Centralization Multiple working copies

Questions