To Cluster or Not Cluster Tom Scanlon NEC Solutions America June 25, 2003 NEC Solutions America
Agenda The PDC Case Study Availability Defined The Dilemma (to cluster or not) Cluster Application Availability Background Data Availability Findings The Continuous Availability Alternative Difference Real World Solutions to the Dilemma 2
Pharmaceutical Discovery Corporation Founded in 1991, PDC is a leader in the pharmaceutical industry. They developed and patented the proprietary Technosphere drug delivery system which offers formulated compounds multiple routes of delivery into the human body. PDC uses the Waters Corp. Millenium32 application for HPLC for their liquid chromatography/mass spectrometry (LC/MS) procedures in QA, QC. PDC needed high availability for their laboratory computing resources. Downtime was not an option. Concerns: Availability Management Cost of outage Non cluster-aware application 3
Availability Definitions Availability Level AL4 (CA) Continuous Availability (CA) Transparent to User Transparent to User Data Integrity User Transaction Experience Performance System Features Disk Memory Maintained Maintained No Impact 100% component and functional redundancy NO transaction loss because memory state is maintained. AL3 (Cluster) High Availability (HA) User stays on-line Transaction may need Restarting Maintained Lost May experience degradation Automatic fail-over transfers user session and workload to backup components; multiple systems connected to disks AL2 (Cluster) High Availability (HA) User Interrupted - Must re-log on Transactions re-run from Journal file Maintained Lost May experience degradation User work transferred to backup components; multiple system access to disks AL1 Conventional With RAID Work Stops Transaction May be lost Maintained Lost Work Stops RAID and log-based/journal file system for identification and recovery of Incomplete in-flight transactions AL0 Conventional Work Stops Transaction Lost Not Ensured Lost Work Stops No redundant system components 4
9 s Spectrum Nines Percent Hours/Year 2 99% 87.60 3 99.9% 8.76 4 99.99% 0.88 5 99.999% 0.09 6 99.9999% 0.01 THE STANDISH GROUP 5
Causes of Downtime Operator Error Main Server System Bug Main Server HW Failure Other Server System Bug Other Server HW Failure Application Bug or Error Planned Entended Planned Network Database Error Environmental Conditions Other 6% 10% 8% 5% 4% 35% 3% 1% 7% 10% 4% 7% THE STANDISH GROUP 6
Downtime Cost Meta Group: How Safe is the Business? Industry Sector Revenue/Hour Revenue/Employee Hour Energy $2,817,846 $569.20 Telecommunications $2,066,245 $168.98 Manufacturing $1,610,645 $134.20 Financial Institutions $1,495,134 $1,079.89 Information Technology $1,344,461 $184.03 Insurance $1,202,444 $370.92 Retail $1,107,274 $244.37 Pharmaceuticals $1,082,252 $167.53 Banking $996,802 $130.52 Food/Beverage Processing $804,192 $153.10 Consumer Products $785,719 $127.98 Chemicals $704,101 $194.53 Transportation $668,586 $107.78 Utilities $643,250 $380.94 Healthcare $636,030 $142.58 Metals/Natural Resources $580,588 $153.11 Professional Services $532,510 $99.59 Electronics $477,366 $74.48 Construction & Engineering $389,601 $216.18 Media $340,432 $119.74 Hospitality $330,654 $38.62 Average $1,010,536 7 $205.55 Source: Meta Group 5 February 2002 File: EDCS 1060 How Safe Is the Business?
When to Cluster? Clustering for scalability Clustering for availability Considerations Hardware & software costs Set-up Maintenance Ongoing training Cost of downtime 8
Cluster / Non-Clusters % of Application Downtime A study conducted by Standish Research International confirmed that applications are down more often in cluster environments than in non-cluster environments. Non- App 61% Cluster APP 39% APP 26% THE STANDISH GROUP Non- App 74% Non-Cluster 9
Cluster / Non-Clusters Hours of Yearly Downtime 12 10 8 6 4 2 0 Clusters Non-Clusters Non-App 5.9 5.0 APP 3.8 1.8 THE STANDISH GROUP 10
Clusters and CA Servers Cluster Servers Continuous Availability Server Normal Operation Node 1 Node 2 App A OS A Heartbeat Shared Storage App B OS B Processor I/O Module App A OS A Processor I/O Module After Failover Node 1 Node 2 X Heartbeat App A App B OS B Processor I/O Module App A OS A X I/O Module Shared Storage 11
NEC FT: Feature / Benefits Fault Tolerant Hardware Software Service Feature System Redundancy Memory Redundancy Active-Active Operation On-Line Spares Complete System Hot-swap Hardened Drivers W2K Advanced Server Universal Application Support Standard OS Management Modular - Low Tech Design Advanced Monitoring Customer Replaceable Value Provides no single point of failure Ensures no loss of in-flight Transaction Data Provides system integrity to OS and Apps Instantaneous failover (no waiting for parts / manpower) Service without system outage OS & Application Stability & Availability Standard application compatibility Supports Any Application Low IT Staff Requirements Service Savings (Repair Time and Maintenance) Problem Resolution and Proactive Service Secure System Maintenance 12
Why NEC FT for Continuous Availability? 0 Downtime hardware - total hardware redundancy Ease of maintenance - modular design Runs standard applications - no SW modifications Lights out computing complete remote management Lower Total Cost of Ownership - beats clusters 13
FT Advantage: Transparent Failover Seconds 120 100 80 60 40 20 0 Average Failover Time Cluster FT Server Instantaneous: FT - Failover occurs without the loss of processing resource No impact: Users do not experience the wait from: Cluster failover detection Assignment of resource Script to reconstruct There is no loss of application state or data after the failover occurs: No - Re-Login No - Data Re-entry No - Hung Applications 14
FT Advantage: Simplified Service Modular Self Service Design Customer Replaceable Units Toner Cartridge approach Without being down Non-Stop Operation: On-Line Spares Instantaneous Failover Non-Stop Service Hot-swappable Modules Low Tech Simplified Approach Real Customer Replaceable Manpower Cost Savings 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% SHV Standard High Volume Server 65% 65% More More CRU CRU Components Components FRU CRU FT 15
FT Advantage: Simplified Administration No scripting to initiate: Fail-over Restoration No external integration: Heart-beat networking Console consolidation Single Console: For Operating System For Applications Typical Cluster Service Administration 16
TCO Model Findings To To Traditional Traditional COST To Clusters FT-Server Cluster Traditional Hardware Operating System Application Phase Install Service Administration Outage 17
CA Solutions: Pharmaceutical PDC chose NEC Fault Tolerant Servers to deliver continuous availability for laboratory production & management. HPLC: Hi-Pressure Liquid Chromatography Creating the Drug Delivery Systems of Tomorrow. Today External Storage Document Management Lab Document Management 500GB Storage Lab & ERP Management Lab Management 18