Service Recovery & Availability. Robert Dickerson June 2010

Similar documents
Transform Availability

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002

10 Reasons Why Your DR Plan Won t Work

Rejuvenating BCM - Infrastructure. Business Continuity Awareness Week March 2009

Mike Hughes Allstate Oracle Tech Lead, Oracle Performance DBA

Data Sheet: High Availability Veritas Cluster Server from Symantec Reduce Application Downtime

New England Data Camp v2.0 It is all about the data! Caregroup Healthcare System. Ayad Shammout Lead Technical DBA

Module 4 STORAGE NETWORK BACKUP & RECOVERY

TSC Business Continuity & Disaster Recovery Session

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

Veritas Cluster Server from Symantec

Maximum Availability Architecture. Oracle Best Practices For High Availability

Virtualizing Oracle on VMware

Roadmap to Availability

The Right Choice for DR: Data Guard, Stretch Clusters, or Remote Mirroring. Ashish Ray Group Product Manager Oracle Corporation

OL Connect Backup licenses

Solution Brief. IBM eserver BladeCenter & VERITAS Solutions for Microsoft Exchange

INTRODUCING VERITAS BACKUP EXEC SUITE

Eliminating Downtime When Migrating or Upgrading to Oracle 10g

Minimal downtime migration

Security+ Guide to Network Security Fundamentals, Third Edition. Chapter 13 Business Continuity

Eliminate Idle Redundancy with Oracle Active Data Guard

Three Steps Toward Zero Downtime. Guide. Solution Guide Server.

Are You Protected. Get Ahead of the Curve

vsan Disaster Recovery November 19, 2017

Copyright 2012 EMC Corporation. All rights reserved.

Certified Information Systems Auditor (CISA)

SQL Server Virtualization 201

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

Are you protected? Get ahead of the curve Global data protection index

Everything You Need To Know About Oracle & Storage Foundation HA

Security and Reliability of the SoundBite Platform Andy Gilbert, VP of Operations Ed Gardner, Information Security Officer

BUSINESS CONTINUITY: THE PROFIT SCENARIO

DH2i s DxEnterprise for Linux Server A SQL Server Use Case

InterSystems High Availability Solutions

Trends in Data Protection CDP and VTL

Copyright 2012 EMC Corporation. All rights reserved. EMC And VMware. <Date>

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Deploying High Availability and Business Resilient R12 Applications over the Cloud

Unified Management for Virtual Storage

Wasser drauf, umrühren, fertig?

Oracle E-Business Availability Options. Solution Series for Oracle: 2 of 5

Hyper-Converged Infrastructure: Providing New Opportunities for Improved Availability

Architecting a Highly Available Infrastructure: An Overview SLN187. Mark Milow Mike DiPetrillo

Edge for All Business

TABLE OF CONTENTS. 2 INTRODUCTION. 4 THE CHALLENGES IN PROTECTING CRITICAL SAP APPLICATIONS. 6 SYMANTEC S SOLUTION FOR ENSURING SAP AVAILABILITY.

HIGH AVAILABILITY AND DISASTER RECOVERY FOR IMDG VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF RUSSIA

Index. Peter A. Carter 2016 P.A. Carter, SQL Server AlwaysOn Revealed, DOI /

Microsoft SQL Server on Stratus ftserver Systems

OPTIMIZING YOUR ORACLE DATABASE ENVIRONMENTS

Industry-leading solutions for transforming data centers into drivers of business value and innovation. Symantec in the Data Center

Enterprise Manager: Scalable Oracle Management

NEC ExpressCluster Introduction.

EMC GLOBAL DATA PROTECTION INDEX KEY FINDINGS & RESULTS FOR HONG KONG

Oracle Real Application Clusters One Node

Tieto s itap Offering

EMC VPLEX WITH SUSE HIGH AVAILABILITY EXTENSION BEST PRACTICES PLANNING

DELIVERING PERFORMANCE, SCALABILITY, AND AVAILABILITY ON THE SERVICENOW NONSTOP CLOUD

SM B10: Rethink Disaster Recovery: Replication and Backup Are Not Enough

How to license Oracle Database programs in DR environments

CA ARCserve Backup. Benefits. Overview. The CA Advantage

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

arcserve r16.5 Hybrid data protection

Veritas Storage Foundation from Symantec

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud. Santa Clara, California April 23th 25th, 2018

BC/DR Strategy with VMware

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

Ensuring Business Resilience Jim Neumann, Vice President of Marketing, Power Analytics Corp.

Stretched Clusters & VMware Site Recovery Manager December 15, 2017

NEC Express5800 R320f Fault Tolerant Servers & NEC ExpressCluster Software

High Availability Infrastructure for Cloud Computing

EMC GLOBAL DATA PROTECTION INDEX KEY FINDINGS & RESULTS FOR BRAZIL

EMC GLOBAL DATA PROTECTION INDEX KEY FINDINGS & RESULTS FOR AMERICAS

PROTECT YOUR DATA, SAFEGUARD YOUR BUSINESS

Achieving Rapid Data Recovery for IBM AIX Environments An Executive Overview of EchoStream for AIX

EMC GLOBAL DATA PROTECTION INDEX KEY FINDINGS & RESULTS FOR APJ

Advanced Architectures for Oracle Database on Amazon EC2

Disaster Recovery and Mitigation: Is your business prepared when disaster hits?

IBM Case Manager on Cloud

Microsoft E xchange 2010 on VMware

How To Make Databases on Linux on System z Highly Available

Oracle Maximum Availability Architecture for Oracle Cloud

Breakthrough Data Recovery for IBM AIX Environments

VMware vsphere 4. The Best Platform for Building Cloud Infrastructures

EMC GLOBAL DATA PROTECTION INDEX KEY FINDINGS & RESULTS FOR AUSTRALIA

Networking Strategy and Optimization Services (NSOS) 2010 IBM Corporation

Transforming your IT infrastructure Journey to the Cloud Mike Sladin

for Backup & Recovery & Failover

White paper. The three levels of high availability Balancing priorities and cost

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 13

1) ADDITIONAL INFORMATION

GIS Application Delivery Challenges & Solutions During State of Emergencies

Symantec Business Continuity Solutions for Operational Risk Management

Network Performance, Security and Reliability Assessment

Oracle WebLogic Server 12c: Seamless Oracle Database Integration

The Future of Business Continuity & Resiliency

How Does Failover Affect Your SLA? How Does Failover Affect Your SLA?

Veritas Storage Foundation for Windows by Symantec

How to solve your backup problems with HP StoreOnce

VMware, Cisco and EMC The VCE Alliance

Transcription:

Service Recovery & Availability Robert Dickerson June 2010

Started in 1971 with $3,000, 40 clients and 1 employee. 2009: over $2B revenue, 500,000+ clients, 13,000 employees. Payroll / Tax Services / 401(k) / Employee Benefits / HR Admin / Time & Labor / Employee Leasing / 1-50 Employees / 50+ Employees Disaster Recovery & Business Continuity Paychex Assets The technical solutions in place to recover IT Computer processing functions and restore service to internal and external clients. The ability to restore critical business functions and protect the corporate assets. 13,000 employees across 100 locations, 9 within Monroe County, 2 in Germany. Data centers running critical applications to support the movement of $1B per day in support of 500,000 businesses. 1,600 windows servers. 2 megawatts of power. 1,400 unix servers. 1,000 terabytes storage. 147tb backed up weekly. 75,000 inbound / 65,000 outbound phone calls each day. Technical infrastructure to support branch offices, data centers and connections to banking and trading partners.

Definitions High Availability (HA): Minimized Unplanned Downtime Unplanned = Hardware failures, network outages, data corruptions, data center failures, etc. Continuous Operations (CO): Minimize Planned Downtime Planned = Patching, application upgrades, code releases, data center maintenance, etc. High Availability + Continuous Operations = Continuous Availability

Service Availability Percentages Best In Class Platform Availability (per Gartner) High Availability approx. 5 hours of unplanned downtime per year or 99.95% Continuous Operations 12 Hours of planned downtime per Year or 99.86% Continuous Availability 17 Hours of downtime per Year or 99.81% Availability Percentage Downtime per Year Downtime per Month 90% 876 Hours 72 Hours 99% 87.6 Hours 7.20 Hours 99.9% 8.76 Hours 43.0 Min 99.95% 4.38 Hours 21.9 Min 99.99% 52 min 4.32 Min 99.999% 5 min 15 sec 25.9 Sec 99.9999% 31.5 sec 2.59 Sec 99.99999% 3 sec 0.25 Sec

Cost of Availability Levels (Gartner) The cost of continuous availability increases by multiples as the target level increases. If X represents the cost of 99% availability then 99.9% has a cost equal to 3X, and 99.999% has a cost equal to 8X Achieving Continuous Availability in the range of 99.999% (just minutes of downtime a year) is very rare and expensive.

Downtime Root Cause Outages affecting Availability Industry Breakdown Unplanned 20% - Hardware, OS, Environmental Factors 40% - Applications 40% - Operations Planned 65% - Application and Database Updates 35% - HW/SW/DC Maintenance, Systems and Applications Management Paychex Unplanned 2% - Hardware, OS, Environmental Factor 22% - Operations (network, configurations, memory, data integrity) 52% - Applications 24% - Unknown (efforts underway to minimize this classification) Contributing factors to downtime: component failure, rate of change, expertise, process & procedures.

High Level Standards for Achieving Increased Availability Infrastructure and Data Resiliency Design and Implement infrastructure that prevents unplanned downtime. Use fault-tolerant hardware and operating systems. Provide redundancy to account for component failures as well as disasters. Limit complexity. Reduce MTTF (Mean Time to Failure). Automation Avoid human intervention for failovers as well as maintenance, test and release procedures. Use hardware load balancers or clustering software to automate failovers. Use automated release and testing tools and procedures. Reduce MTTR (Mean Time to Recovery). IT management processes Improve the maturity level of standard IT management processes (proactive monitoring, change, release, configuration and incident/problem). Reduce amount of change and/or probability that change will cause downtime. Monitor failures and capacity proactively vs. reactively. Understand integrations and maintain consistent configurations. Formally report and investigate incidents affecting availability. Per Gartner, IT Maturity Levels of 3 or above are required in order to be highly available. Application Development Design Applications with availability in mind (non-disruptive installation, no hard-coded limits, dynamic initialization, backwards compatibility) to support minimizing planned downtime (i.e. rolling upgrades. Consider availability requirements early in the design process. Once 1/3 of code has changed, Gartner recommends a complete application rewrite. Testing Test all possible use cases in representative test environments. Complete integrated, performance, and post-release testing.

Strategic HA and CO Technologies (Currently Available) Load Balancing Session State Management Hardware Clustering Database/Application Clustering Fault-Tolerant Hardware/Self-Healing OS Data Replication Data Recovery Proactive Monitoring Incident Tracking and Reporting Automation Configuration Management Development Lifecycle Design Principles and Standards Continuous Operations F5 Devices for Load Balancing active/active, redundant configurations and automated failovers Coherence*Web for session state management to support active/active configurations Veritas and Microsoft Hardware clustering software to mitigate hardware and OS failures Oracle RAC for multi-node database clustering to mitigate hardware or software failures on database server nodes We use hardware models and Operating Systems that have features to prevent downtime from occurring due to hardware/os failures. Dataguard for database replication and SRDF for storage-based date replication to allow for hardware recovery on standby systems Backups to disk and tape for data recovery. Snapshot and snapback technologies. RUM, Aternity and Edgesight for End User Experience Monitoring. OEM for DB monitoring. SCOM and OVO for hardware/os/app monitoring. Service Center is currently in used for Problem Management. Tidal for automated releases. F5/SM & WL plugins for failovers. Service Center is currently in used for Configuration Management. Quality Center for development lifecycle quality and consistency Troux for architecture principles and standards for quality and consistency of designs. Hardware Clustering allows for rolling HW/OS patching. Some hardware has hot-swappable components. = Relatively New to Paychex

Strategic HA and CO Technologies (Investigative Stage) Load Balancing Session State Management Hardware Clustering Database/Application Clustering Fault-Tolerant HW/Self-Healing OS Data Replication Data Recovery Proactive Monitoring All Feature/functions of the F5 devices (LTMs/GTMs) that can enhance availability No new technology investigations at this time. No new technology investigations at this time. Oracle SOA Suite 11g to allow for implementation across datacenters Use of Superdome technologies. Active Dataguard for data replication and read-only access to standby. Coherence for data caching and replication. Volume snapshots for data recovery. Central monitoring console, extend use of existing tools for proactive vs. reactive monitoring Incident Tracking and Reporting Automation Configuration Management Development Lifecycle Design Principles and Standards Virtualization Continuous Operations Event Correlation tools, enhanced service level reporting tools Technologies to: provide automated post-release test to decrease planned downtime windows, further automate release processes and build/turnover processes Is Service Center adequate for mapping application and system interdependencies? Technologies to deliver self-healing applications with rolling upgrade capabilites. No new technology investigations at this time. Investigate role in providing HA and Continuous Operations solutions. Transient Logical Standby procedures in Oracle 11g to apply database patches in a rolling fashion

HA and CO Design Considerations and Procedures Redundant Infrastructure Load Balancing Session State Management Proactive Monitoring Incident Tracking and Reporting Automation Configuration Management Development Lifecycle Design Principles and Standards Service Level Agreements/BIAs Testing Environments Storage, Network, Security, Server infrastructure must have redundant components. Load Balance Across data centers when appropriate. Load Balance based on performance metrics of target vs. using round robin approach. Maintain active/active configurations by utilizing Session State tools. Build health monitors into applications to be utilized by F5 devices. Continue process toward making Proactive Monitoring part of the design process not just implementation. Capture all Availability affecting incidents in Service Center while reducing the number incidents classified with unknown root cause. Further Automate failover and recovery procedures, post-release testing and build processes to reduce time and errors. Ensure standby and primary systems have matching configurations. Thoroughly map application and system interdependencies/integrations. Matching availability of Build/Release automation tools with Test Engineering requirements Analyze Agile Development Model for impacts across all areas (support, release management, testing) Architect applications with high availability features to allow for more seamless failovers and rolling upgrades as well as self-healing features. Consider rewriting applications that have been over-updated. Determine best balance between frequency and complexity of release. Develop Availability Design standards Define accurate and comprehensive SLAs and availability requirements early in the project lifecycle process in order to affect approaches and designs. Define conditional SLAs for impaired vs. unavailable. Assess impacts for downtime on build automation and testing tools on testing timelines and completeness. Enhance integrated testing to make it more comprehensive. Design productionlike test environments. Refine resource scheduling processes to make the most of the available test environments.

Predictive Model What is it? Model used to predict the overall platform availability based on probabilities of unplanned downtime. What does it provide? Allows for calculating the impact of component changes to overall availability. Can be used to help create Availability roadmap. Provides consistent way to assign availability expectations to given design as well as identify gaps. How is it used? Prediction is attained through industry failure rates and Paychex history. Continuous feedback from actual experience used to update predictions. When is it used? This model can be used during the Solutions Approach and Design Phases to assess the predicted availability percentage of given Solution/Design. It can also be used as part of an assessment of an existing Product. Work is being done to add it to the project lifecycle as a deliverable.

Predictive Model Sample