A Practical Guide to Avoiding Disasters in Mission-Critical Facilities Todd Bermont What is a Disaster? An event that can unexpectedly impact the continuity of your business Anything that injures or has the potential to injure your employees, data, the environment, or your facility itself Accidents, HAZMAT spills, fires, floods, tornadoes, hurricanes, terrorism, earthquake, utility outages, human-error, equipment failures, and virtually any other event that may injure people, data, the environment, or property Associated Business Issues Significant Costs Associated with Downtime META/Gartner Group: $330,000/Hour Strategic Research Group For a Brokerage Firm: >$6.5 Million/Minute Continued Pressure on the Infrastructure Fragile Power Grid, Nature, Terrorist Threats, Hackers, Viruses Major Changes to Equipment in Data Centers Blade Servers = High Performance, High Density, Heat Generating Equipment Now Actual $$$ LIABILITY! Government Regulations (HIPAA, Sarbanes-Oxley, SAS 70, SEC & FDA) Service Level Agreements that Mandate Uptime The Bottom Line is: Downtime is Unacceptable! 1
Risk Exposure in the Data Center 86% of Data Center Downtime is Due to Infrastructure Failure & Human Error! - Gartner 60% of all declared disasters due to power or hardware failure! DR Firm Infrastructure Yet until recently, most firms have spent a disproportionate amount of their IT budget on disaster recovery instead of disaster avoidance Failure 71% Human Error 15% Environmental Factors 14% Infrastructure Risk Over 95% of all infrastructure failures occur between the UPS and the load! Uptime Institute.DR. (Disaster Recovery Plan).IT (Equipment). The Key is To Focus on Mitigating RISK in this Gap!. UPS, Generators, HVAC. Four Categories of Failures Leading to Disaster Design Failures Catastrophic Failures Compounding Failures Human-error Failures 2
Preventing Design Failures Develop a comprehensive design intent Select appropriate design firms that have experience in your specific application Be an active member of the design process Review, check and recheck - - Consider using a peer review de Havilland Comet 1A Preventing Catastrophic Failures Comprehensive maintenance program Predictive analysis Implement a Lessons Learned program If it can break it will Plan for it! Solar storm causes transformer failure, dropping the Quebec power grid, and causing power problems throughout the U.S. Preventing Compounding Failures Sweat the small stuff! Test mission-critical infrastructure as an integrated system Proactively maintain your equipment It was just a small leak 3
Preventing Human-Error Failures Training Use switch level detailed Method Operating Procedures (MOPs) & verify accuracy Use a pilot / co-pilot approach during switching operations USE THE MOP! Most of us are NOT Einstein Data Center State = Ability to Succeed OR Disaster Avoidance Considerations ID Vulnerabilities Catalog Equipment Quantify Capacity ID Procedural Risks Conduct Annually Physical Assessment Design Redundancy Maintainability Scalability Safety 59% of Companies say New Equipment is Purchased w/o Regard for Power & Cooling*** Integrated Systems Testing Detailed MOPs & SOPs Match Build to Design Intent Testing & Commissioning Maintenance & Monitoring 52% of Companies had Operations Interrupted due to Hardware Failure* A Typical Large Data Center Requires Hundreds of Maintenance Activities** Early Detection = Minimal Disruption * Dulles Conference on Emergency Response Planning ** Lee Technologies Maintained Facilities *** Joint InterUnity Group AFCOM Study April, 2005 4
Physical Assessment Objective Comprehensive Documented Design Design with the End in Mind Outline Your Goals & Objectives Quantify Your Cost of Downtime Match Resiliency with Impact to the Business My HVAC Needs to be on the Generator too? Testing & Commissioning Perform realistic testing, even though it can take time & $ Utilize a systematic process of verifying and documenting the performance of the facility s equipment Use some one impartial who did not design or engineer your facility Vendor Start-up is not Commissioning! 5
Maintenance & Monitoring Conductive preventive maintenance (PM) as recommended by vendor Develop a comprehensive template & perform a daily walk-through Monitor devices most critical and most likely to fail Proactive Disaster Avoidance MOPs & SOPs Warning Signs Predictive Maintenance Escalation & DR Safety Ongoing Training Internal Controls & Safety Detailed MOPS & SOPS Documented Maintenance Tickets Daily Walk-thru Detailed Maintenance Schedule Weekly Prioritization Meetings Safety As Built Drawings Facility One-Line MOPs & SOPs Maintenance Tickets Training Manuals Daily Logs Assessment Reports Accurate Inventory Documentation Proactive Maintenance PM s as Recommended Regular Maintenance of Filters, Fuel, Coolant, etc Just Like Your Car Escalation Procedures for Surprise Issues 7x24x365 monitoring Ongoing Training Understand how your equipment functions Train operators and supervisors Training should include: Modules for equipment Modules for procedures Operations Procedures Maintenance Procedures Disaster Recover/Emergency Response Procedures Safety Procedures Don t lose knowledge at your site. Capture it! 6
Internal Controls & Safety Supervise maintenance activities Have someone present who possesses the proper knowledge and skill set of the equipment being maintained Document revisions when there are changes to the scope of work and procedures Be consistent There is no such thing as an un-safe, reliable data center - - Make sure all safety standards are followed in both operations and maintenance Document, Document, Document: Include record keeping requirements in service contracts Documentation generated by the service contractor provides building operations staff and management with critical information for comparing past and current conditions of equipment and system performance All work should be documented in an organized fashion: Completed Methods of Procedures (MOPs) Defective Items & Corrections Made Parts Used Before and after data Document, Document, Document Continued: Keep documentation in soft copies & back up your data! Document all Preventive Maintenance (PM) activities: Will help locate recurring problems Provide an understanding of when equipment performance is degrading Ensure that the contractor is performing to scope of work Increase total system reliability Document all Lessons Learned 7
Proactive Maintenance Fix it before it breaks Utilize predictive maintenance & monitoring Understand how your system operates and know where the weak points are Use your data and past experiences Correct weak items before they fail Modify procedures & scope of work to address such items Adjust your data gathering and collecting as necessary What next? How can you avoid potential disasters moving forward? Conduct an objective physical assessment of your mission-critical facilities Identify the most critical vulnerabilities in equipment & operations Prioritize most critical issues Develop a plan to address those issues (Training, Operations, Expansion, Maintenance, & Disaster Recovery) Implement your plan Thank You for Attending! For More Information, Please Contact: Todd L. Bermont Email: tbermont@leetechnologies.com Phone: (847) 680-8809 8