Architecting a Highly Available Infrastructure: An Overview SLN187 Mark Milow Mike DiPetrillo
Agenda Why? Solutions Q&A
Quick Stats 2004 Gartner study found an average of $42,000 per hour of downtime Average network experiences 175 hours of downtime a year (98% availability) That s $7.35 million in lost revenue each year An increase to 99% means only $3.68 million That s $3.67 million you can spend on DR
Quick Math: Amazon.com Revenue in 2001: $3.1B/year with 7774 employees Revenue Per Hour: ~$350,000 If outage effects 90% of revenue: ~$320,000 Assume average annual salary is $85,000 $656M/year or $12.5M/week for all staff @ 50 Hours/week: ~ $250,000 per hour If outage effects 80% of employees: ~$200,000 Total is $520,000 per hour of downtime
Why and for What? Different levels Planned disasters (hurricanes, etc.) Unplanned disasters (power outage, tornadoes, etc.) High availability Plan for local (inside the datacenter) failures Disaster recovery Plan for regional (the datacenter is gone) failures
Definitions MTR: Mean Time to Recover CTR: Cost To Recover Tiers: Levels of Recovery Options
Helpful Hints Tier data Break data into tiers of different MTR commits Networking Plan how to get your network to fail with the data People Don t plan on people flying to remote sites Automate Automate as much of the failover as possible
Local High Availability
Local High Availability Features Standard clustering agents Solutions for non-cluster aware apps Stateful and non-stateful failover Failover and fail-back Product vendors Legato, Microsoft, Veritas, Steeleye, Linux Deployment scenarios Physical to virtual Virtual to virtual Poor man s
Physical to Virtual Clustering Primary Server MS Exchange Windows 2000 Shared disks, arrays or SAN storage ESX Server Failover Server MS Exchange Windows 2000 1U, 2-way Rack File / Print Windows NT Data File / Print Windows NT 1U, 2-way Rack Intranet App Server on Windows 2000 Data Intranet App Server on Windows 2000 1U, 2-way Rack Data 4U, 8-way Rackmount with ESX Server
Virtual to Virtual Clustering
Poor Man s Clustering ESX Server 1 ESX Server 2 VM1 VM3 VM2 ON Shared disks arrays, or SAN storage ON VM4 VM6 VM5 VM4 VM5 VM1 VM2 VM6 OFF VM1 VM2 VM3 OFF ON VM3 VM4 VM5 VM6 Dell 4-way Rackmount HP 4-way Rackmount
Local High Availability Considerations Cost MTR Number of virtual machines Disk space Benefits Out-of-the-box solution Stateful failover Inexpensive For any application
OS Based Solutions
OS Based Solutions Features Regular agents Very similar to physical environment Efficiencies from virtual machine architecture Product vendors Legato, Symantec, Veritas Deployment scenarios Physical to virtual Virtual to virtual Trunk of Car
Physical to Virtual Backup Server Tape Array
Virtual to Virtual Backup Server Tape Array
Trunk of Car
OS Based Solutions Considerations Cost Number of virtual machines Bandwidth Disk space Benefits Standard solution No learning curve Great reduction in agent cost
Host Based Solutions
Host Based Solutions Features Agent based, runs as a service File/byte level replication Synchronous and asynchronous Failover and fail-back Product Vendors Legato, NSI Double-Take, NeverFail, Mimix Deployment Scenarios Physical to virtual Virtual to virtual Multi-node
Virtual to Virtual Primary Site Failover Site
Virtual to Virtual Primary Site Failover Site
Virtual to Virtual Multi-Node Primary Site Failover Site
Virtual to Virtual Multi-Node Primary Site Failover Site
Virtual to Virtual Multi-Node Primary Site Failover Site
Host Based Solutions Considerations Cost Distance Number of virtual machines Bandwidth Benefits Out-of-the-box solution Maximum uptime Ease of use
SAN Based Solutions
SAN Based Solutions Features SAN layered applications LUN Snapshot Replication/Mirroring Block level replication Synchronous and asynchronous Products Vendors EMC, HP, IBM, Network Appliance Deployment Scenarios Physical to virtual Virtual to virtual Hybrid
SAN: Virtual to Virtual Host Agent For Replication and Failover Primary Site Failover Site
Hybrid Mode Backup Host Agent For Virtual Machine and Application Failover SAN Replication Primary Site Failover Site
SAN Based Solutions Considerations Cost Distance Bandwidth Downtime Complex configuration Benefits High performance Centralization Multiple working copies
Questions