BUSINESS CONTINUITY PLAN Document Number: 100-P-01 v1.4

Similar documents
Infrastructure as a Service (IaaS) Public, Private and Hybrid Cloud Solutions from Green Cloud Technologies.

DISASTER RECOVERY PRIMER

INFORMATION SECURITY- DISASTER RECOVERY

Solution Pack. Managed Services Virtual Private Cloud Managed Database Service Selections and Prerequisites

University Information Systems. Administrative Computing Services. Contingency Plan. Overview

Service Level Agreement

Managed Service. Managed Services. High Availability / Disaster Recovery Solutions. Cloud and Hosting Solutions. Security Solutions.

Disaster Recovery-to-the- Cloud Best Practices

Version v November 2015

BUSINESS CONTINUITY. Topics covered in this checklist include: General Planning

v February 2016

CANVAS DISASTER RECOVERY PLAN AND PROCEDURES

SERVICE DESCRIPTION MANAGED BACKUP & RECOVERY

Disaster Recovery and Mitigation: Is your business prepared when disaster hits?

CUNY Graduate Center Information Technology. IT Provisioning for Business Continuity & Disaster Recovery Effective Date: April 6, 2018

Version v November 2015

Application Lifecycle Management on Softwareas-a-Service

Service Level Agreement (SLA) and Service Level Objectives (SLO)

What is Data Protection and Disaster Recovery?

Security+ Guide to Network Security Fundamentals, Third Edition. Chapter 13 Business Continuity

Epicor ERP Cloud Services Specification Multi-Tenant and Dedicated Tenant Cloud Services (Updated July 31, 2017)

CLOUD DISASTER RECOVERY. A Panel Discussion

BUSINESS CONTINUITY MANAGEMENT PROGRAM OVERVIEW

White Paper BC/DR in the Cloud Era

CUNY Graduate Center Information Technology IT Provisioning for Business Continuity & Disaster Recovery Effective Date: November 14, 2018

Welcome! Considering a Warm Disaster Recovery Site?

ServeRestore Service Description

Annual Public Safety PSAP Survey results

Disaster Recovery (DR) Planning with the Cloud Desktop

BC/DR Strategy with VMware

Projectplace: A Secure Project Collaboration Solution

Buyer s Guide: DRaaS features and functionality

The following table provides additional detail by identifying the items included in the service.

Standard CIP Cyber Security Critical Cyber Asset Identification

Network Performance, Security and Reliability Assessment

Continuity of Business

TECHNICAL AND ORGANIZATIONAL DATA SECURITY MEASURES

HYBRID CLOUD BACKUP & DISASTER RECOVERY

EVERYTHING YOU NEED TO KNOW ABOUT NETWORK FAILOVER

VEMBU VS VEEAM Why Vembu is Better. VEMBU TECHNOLOGIES

Service Level Agreement

MUNICIPALITY OF NORRISTOWN. Responses to Proposal Questions

VMware vcloud Air Accelerator Service

Don t Jeopardize Your Business: 5 Key Business Continuity Use Cases for Cloud

Service Description CloudCore

CUNY Graduate Center Information Technology. IT Provisioning for Business Continuity & Disaster Recovery Effective Date: May 26, 2017

Disaster Recovery Planning

DR Planning for SMBs. E-Guide

Disaster Recovery Plan

Cisco Meraki Privacy and Security Practices. List of Technical and Organizational Measures

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Appendix 3 Disaster Recovery Plan

SOP Unplanned Outages/Major Incidents

Arcserve Solutions for Amazon Web Services (AWS)

` 2017 CloudEndure 1

Atmosphere Fax Network Architecture Whitepaper

Audit & Advisory Services. IT Disaster Recovery Audit 2015 Report Date January 28, 2015

Technical Plan Version 7.11

arcserve r16.5 Hybrid data protection

Dude Solutions Business Continuity Overview

Standard CIP Cyber Security Critical Cyber Asset Identification

Virtualization with Arcserve Unified Data Protection

Real-time Protection for Microsoft Hyper-V

Disaster Recovery. Lewan Technology, Zerto, FORTRUST & Faction Inc

Tintri Cloud Connector

Texas A&M AgriLife Research Procedures

WHITE PAPER- Managed Services Security Practices

ConRes IaaS Management Services for Microsoft Azure

10 Reasons Why Your DR Plan Won t Work

Principles of Protection: Cybersecurity Data Protection. 11/01/2017 Julia Breaux William Sellers

Managed WAN SLA. Contents

IBM Cloud for VMware Solutions Zerto Virtual Replication

COMCAST ENTERPRISE SERVICES PRODUCT-SPECIFIC ATTACHMENT SOFTWARE-DEFINED WIDE AREA NETWORKING (SD-WAN)

IPMA State of Washington. Disaster Recovery in. State and Local. Governments

C H A P T E R Overview Figure 1-1 What is Disaster Recovery as a Service?

TECHNOLOGY SUPPORT SERVICE LEVEL AGREEMENT

Advanced Architecture Design for Cloud-Based Disaster Recovery WHITE PAPER

21CTL Disaster Recovery, Workload Mobility and Infrastructure as a Service Proposal. By Adeyemi Ademola E. Cloud Engineer

THE STATE OF CLOUD & DATA PROTECTION 2018

Version v January 2016

XO Wide Area Network ( WAN ) Services IP Virtual Private Network Services Ethernet VPLS Services

Module 4 STORAGE NETWORK BACKUP & RECOVERY

Microsoft SQL Server

NU Cloud Terms of Service

Introduction to Business continuity Planning

Oracle Managed Cloud Services for Software as a Service - Service Descriptions. February 2018

Disaster Recovery Is A Business Strategy

Disaster Recovery Webinar August 11, 2015

Information Security Incident Response Plan

Concord Fax Network Architecture. White Paper

Hosting Services - Infrastructure. 1.0 Terminology. 2.0 Service Description. 3.0 Service Options

Web Hosting: Mason Home Page Server (Jiju) Service Level Agreement 2012

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper By Anton Els

Page 1 of 8 ATTACHMENT H

Aljex Software, Inc. Business Continuity & Disaster Recovery Plan. Last Updated: 1/30/2017.

Bare Metal Cloud. 1.0 Terminology. 3.0 Service Options. 2.0 Service Description

Managed WAN SLA. Contents

Copyright 2015 EMC Corporation. All rights reserved. Published in the USA.

Protecting VMware vsphere/esx Environments with CA ARCserve

West AT&T TXT Power Service Guide

Transcription:

BUSINESS CONTINUITY PLAN Document Number: 100-P-01 v1.4 2016 DR Committee 1

Table of Contents REVISION HISTORY... 3 PURPOSE... 3 POLICY... 3 DR COMMITTEE... 3 POLICY MANAGEMENT... 3 EMPLOYEE TRAINING AND AWARENESS... 4 TESTING AND PREPARATION... 4 DATA BACKUP PLAN... 5 CUSTOMER DATA BACKUP... 5 CORPORATE DATA BACKUP... 6 DATA BACKUP VERIFICATION... 6 PERSONNEL MANAGEMENT... 6 BUSINESS DISRUPTION RESPONSE PLAN... 7 IDENTIFICATION... 7 DECLARATION... 7 CATEGORIZATION... 8 ESTABLISH COMMAND CENTER... 8 COMMUNICATION METHODS... 9 POST MORTEM... 9 CATASTROPHIC SITE FAILURE PLAN... 10 EQUIPMENT FAILOVER PROCEDURES... 10 FAILBACK PROCEDURES... 11 Public 2

Revision History Update Version Date Author Initial Plan 1.0 1/4/2013 Eric Hester Add 2014 committee updates 1.1 2/9/2014 Eric Hester Add Testing Timeline 1.2 10/7/2014 Eric Hester Add Document number 1.2 10/13/2014 Joe Phillips Add Outage Response process 1.2 1/30/2015 Joe Phillips Revised Document number/location 1.3 5/16/2015 Joe Phillips Add 2015 DR committee updates 1.3 5/20/2015 Joe Phillips Update Site Failover Plan 1.3 8/6/2015 Joe Phillips, Jonathan Nalley, Eric Hester Updates from Annual review 1.4 4/4/2016 Joe Phillips Update to DR Committee members titles 1.4 8/30/2016 Joe Phillips Purpose The intent of this plan is to define a business continuity framework for Green Cloud Technologies so as to limit the effect of service disruptions to both the company and its customers. The plan described herein is a working plan which evolves over time through testing and experience, and therefore requires continual review. An essential mechanism of the framework is the practice of consistent testing and subsequent revision of the working plan. Policy DR Committee For the purposes of oversight a DR committee is formed consisting of the following: CTO COO Director of Business Operations Manager of Network Operations This committee will convene no less than once a year to formally review the documented policies, processes, information in this document and update appropriately based on experienced gained through operations of the business in the previous year. Any updates recommended between such meetings will require approval via email of the entire committee. The committee will also be responsible for management of ongoing testing of the outlined plan to be performed on a recurring basis to be scheduled at their discretion. Policy Management The most recent electronic version of this document must be maintained in the following locations: Public 3

In the Documentation Library, on the company file server On the internal corporate WIKI Printed copies of the most recent version should always be on hand at the following locations: CTO s Office and Home NOC Manager s Office and Home Network Operation Center All Datacenters Any revisions to this document require the approval of the DR committee prior to formal updates to this document can be made and distributed. Once approved by the DR committee, any revisions made to this document require that the revision history table be appropriately updated and all of the locations outlined above be updated with the new iteration of the document. Employee Training and Awareness All employees will be required to sign an acknowledgement of this document (via the Security Policy Acknowledgement form). All departments must perform business continuity training at new employee on-boarding, and as policy updates are made available. The DR committee is responsible for confirming that all training is accurate and complete. Testing and Preparation It is imperative that the processes outlined in this document are routinely tested in as near real world conditions as possible to ensure the best possible experience when real catastrophe strikes. Routine component level redundancy testing and preventative maintenance are the key functions to avoid many catastrophic failure scenarios. Each of these processes have separate process documentation, but together are critical to the success of business continuity efforts and to be made a priority of the Operations team. Individual component level testing and elemental preventative maintenance is conducted routinely each quarter, with reasonable effort to test the recovery processes for catastrophic equipment and network failure on a yearly basis. The annual tests must at a minimum validate the overall process in lieu of implementing an actual customer impacting failure. Public 4

Data Backup Plan Customer Data Backup Green Cloud customer data is housed on an enterprise class SAN/NAS infrastructure. Backups occur using snapshots and are stored and recoverable in accordance with each customer s individual service agreement, based on selected the IaaS service option entitled Storage Profile. NOTE: DRaaS ExpressRestore customers and Desktop as a Service (DaaS) customers are afforded the same restore capabilities as the Local storage profile. A snapshot is a point-in-time backup of data saved to disk. In a snapshot, only the changed blocks from the current image are saved, preventing the need to store the entire file system at the time of each backup. This allows for an instant rollback to a previous point in time, which for Green Cloud is typically between 12 a.m. (midnight) and 2 a.m. Coordinated Universal Time (UTC) of each calendar day. "Local" A snapshot of the virtual server image will be created and saved automatically on a daily basis to the local Storage Area Network (SAN). Each individual daily snapshot is archived by default for seven (7) calendar days on the same storage platform on which the virtual server resides. "Offsite Backup" A snapshot of the virtual server image will be created and saved automatically on a daily basis to the local SAN. Each individual daily snapshot is archived by default for seven (7) calendar days on the same storage platform, and additionally to an alternative, geographically disparate SAN. The secondary SAN will have no I/O guarantee, and recovery can occur by moving the snapshot back to the primary SAN when it becomes available. "24 Hour" A snapshot of the virtual server image will be created and saved automatically on a daily basis to the local SAN. Each individual daily snapshot is archived by default for seven (7) calendar days on the same storage platform, and additionally to an alternative, geographically disparate SAN. The secondary SAN will have a corresponding I/O guarantee (storage profile type) as that of the primary SAN. Should the primary SAN become unavailable, the server image(s) can be restored at the secondary location within twenty-four (24) hours. "6 Hour" A snapshot of the virtual server image will be created and saved automatically at least every six (6) to the local SAN. Each individual daily snapshot is archived by default for seven (7) calendar days on the same storage platform, and additionally to an alternative, geographically disparate SAN. The secondary SAN will have a corresponding I/O guarantee (storage profile type) as that of the primary SAN. Should the primary SAN become unavailable, the server image(s) can be restored at the secondary location within six (6) hours. Recovery Option Snapshot Destination(s) Restore Destination(s) RTO RPO Local SAN Retention Local Only Local SAN Local SAN None 24 Hrs 7 Days Offsite Local & Offsite Local SAN None 24 Hrs 7 Days 24 Hour Local & Offsite Local or Offsite 24 Hrs 24 Hrs 7 Days 6 Hour Local & Offsite Local or Offsite 6 Hrs 6 Hrs 7 Days w/ ExpressRestore Local Local 1 Hr <15 Min 7 Days Public 5

Corporate Data Backup Green Cloud s critical corporate data is housed on an enterprise class SAN/NAS infrastructure in dedicated clusters, logically and physically separate from customer data. The corporate data backup plan is twofold to account for both critical and non-critical information and systems. Critical information and systems are those that support the production environment necessary to support customer and employee services, such as management and monitoring appliances, authentication servers, and databases. All other systems and data are considered non-critical. For non-critical data, backups occur using snapshots which are stored both on the same local storage platform as where the particular virtual server resides, and additionally the snapshots are copied to an alternative, geographically disparate SAN environment. Snapshots are taken at least every six (6) hours. The secondary SAN has a corresponding I/O guarantee (storage profile type) as that of the primary SAN. Should the primary SAN become unavailable, the server image(s) can be restored at the secondary location within six (6) hours. For critical data, backups occur using snapshots at the local SAN level in the same fashion as non-critical data. Additionally, disaster recovery software is continually replicating backups to an alternative, geographically disparate SAN environment. Should a business disruption occur on a critical system, the disaster recovery software can be leveraged to restore systems and applications to the secondary SAN within one (1) hour, with restore points as recent as fifteen (15) minutes. Any data stored locally on desktops or laptops is not be backed up to the file server(s) and therefore is not recoverable. Corporate security and information management policy state that all corporate data resides on the corporate file servers and is not to be downloaded, transferred, or moved elsewhere. Data Backup Verification Since snapshots are based on previous state of actual blocks of the active file system they are inherently verified. The replication of data between data centers is verified via checksums of each data block transferred. Any failure of the snapshot or replication process is monitored and resolved by network operations. As part of routine redundancy and failover testing a random sampling of VMs should be brought online from snapshots both locally and from replicated data to ensure no corruption has occurred. Personnel Management The safety of personnel during a crisis of any kind is paramount. To this end, all staff shall be equipped with necessary means to continue to work from the safest location possible. Personal computers -- or company provided laptops should be configured with these minimums: Telephony Soft-Client with pre-built credentials USB Headset Preconfigured VPN Access Desktop as a Service (DaaS) Client With remote access credentials preconfigured, the outage leader can insure that the physical work locations of all employees are safe and considerate of travel conditions. Public 6

For inclement weather or catastrophic scenarios involving the Corporate (Greenville) office, Operations management will determine if travel is unsafe and may instruct scheduled staff to work from their home or an alternative safe location. For catastrophic scenarios involving the data center(s), but not the Corporate office, technical staff is expected to convene at the Corporate office. During a critical event, abnormal working hours are to be expected. Operations management will schedule the employees shifts to insure continued productivity while preventing fatigue. If an unscheduled event requires staff to work beyond their normally scheduled days or times, the outage leader will provide to staff and management a shift schedule. Business Disruption Response Plan Identification When a potential outage is discovered through network, hardware, or software alarms or by repeated customer trouble reports in a short timeframe, the identifying personnel will immediately notify a member of the DR committee via telephone and email in the following escalating order: NOC Manager Operations Management CTO COO Updated contact information for the above personnel is available in the Employee Directory. All potential outage communications will include a short description of the scenario discovered, an estimate of the number of customers possibly impacted, and any patterns or common elements discovered in initial triage. These actions and relevant data are to be recorded in an Incident ticket, per the Incident Management process. From here on referred to as the "master" ticket for the issue, all related customer reported Incidents will be linked to the master ticket. The master ticket will be updated as trouble isolation occurs and the level of impact changes in the discovery phase. Once a member of management has been reached, they will designate the outage leader and begin the process to assess and determine the scope of the outage and appropriate action per the process described below. In the unlikely scenario that none of the above forms of communication methods are functional (e.g. catastrophic natural event), any available Operations and Engineering personnel are to converge on a designated command center to manage the event and subsequent customer communications to a reasonable degree, with physical safety being of primary concern. Declaration At the discretion of the designated leader, an outage will be declared and at that time will be assigned one of the following color-coded severity levels. Based on the severity/category assigned to the system event, internal and external communication are to immediately follow. Public 7

Categorization The categorization schema for business disruptions is based two factors: the scope of the issue from an infrastructure perspective and the volume of end-users impacted by the event. The severity level determines the internal and external communication methods, the processes necessary to meet service level agreements, and supports a continual improvement model. White The definition of a white level event: Impact is limited to a small percentage of total customers; or the issue is isolated to a single network or equipment failure and can be resolved in short timeframe; or there is less than 1 hour anticipated downtime. Yellow The definition of a yellow level event: Impact is limited to less than one-half of total customer base; or the issue is isolated to a particular service (e.g. Hosted PBX, DRaaS, IaaS); or there is an undetermined estimated time to repair (ETR) but known root cause. Orange The definition of an orange level event: Impact is to more than one-half of total customer base; or issue is not isolated to a particular service (e.g. network, compute, etc.); or there is an undetermined ETR and an unknown root cause. Red The definition of a red level event: Impact is to more three-quarters of total customer base; or issue results in the customer s inability to contact Green Cloud (e.g. email, web and telephone services are down). Customers Impacted Scope / Services Impacted Severity Level Low Isolated element White < 50% > 50% or Single service (e.g. PBX, DRaaS) Multiple services (e.g. network, compute) Yellow Orange > 75% All Green Cloud communications Red Establish Command Center Once the outage has been declared, an internal command center will be established for technical personnel communication, and if necessary personnel may need to convene at the most appropriate rendezvous location. The establishment of a physical command center will be made at the discretion of the outage leader based on conditions such as scope of issue, inclement weather, availability of internet and telephone connectivity, and the safety of personnel. The pre-designated locations have been identified as: Public 8

Corporate Office (Greenville) Conference Room Greenville Data Center War Room Nashville Data Center Conference Room Communication Methods A declaration email will be delivered internally to all Green Cloud employees indicating the severity level and initial facts about the disruption. As the outage is further identified, the severity level may change. Depending on the severity level, initial notification will be made via the Green Cloud public Operational Status web page: http://status.grncld.net Green Cloud recommends that all end-users, partners, vendors and employees subscribe to updates via the Operational Status web page. Updates are currently available via email and/or SMS (text message). Both internal and external updates will be provided at the pre-designated time intervals, depending on the severity outlined in the chart below, via approved communication methods. Regular updates are the responsibility of the outage leader. Communication WHITE YELLOW ORANGE RED Update status.grncld.net? Y Y Y Y Updates Method Status Page Status Page Status Page Status Page Updates Frequency (minimum) None 2hrs 1hr 30min Management Follow Up N N Y Y Provide Reason for Outage document N N Y Y Post Mortem Upon conclusion of an event, it will be the responsibility of Operations Management to provide an Reason For Outage (RFO) document detailing the outage, its root cause, and actions taken to be shared with customers who request the RFO. This RFO will be available to all employees on the corporate file server. For Orange and Red level events, each member of management will be assigned a portion of the impacted customers list to personally contact, provide the RFO, and manage the delegation of any technical follow-up issues. Any lessons learned, customer and partner feedback, and corrective actions intended to prevent re-occurrences of the disruption (if avoidable) will be provided to the DR Committee at this time to allow for continual improvement. Public 9

Catastrophic Site Failure Plan Equipment Failover Process In the unlikely event an entire Green Cloud data center facility is lost to catastrophic failure of either all or a majority of its facilities, failover to the data replication site for the failed data center must be evaluated. The determination to initiate failover will be made by the outage leader, CTO, and all available members of senior management. This decision will be made based on the severity of the damage to the data center facility itself and/or its critical infrastructure such as green cloud equipment, power, network, cooling. If the team determines that the restoration time at the impacted facility is greater than the time required to failover, then the failover will commence. Failover steps shall be outlined in detail in the respective and appropriate Engineering and Operations standard operating procedures, however the overall process may require one or more of the following: Site survey and Site shutdown Any remaining operational equipment in the failed site will be shut down completely to avoid conflict with services enabled at the failover site. a. Survey team arrives on-site to assess severity level and establish visual confirmation of event b. All equipment powered off and physically removed from PDUs c. Note: This step can and may be completed in parallel with network failover Network Failover a. Failover the firewall configuration (Cisco ASA Security Context) b. Establish vcloud External network gateways as sub-interfaces on Po1 on ASR c. Where possible, move private carrier interconnect circuits to failed site Storage Failover All replicated volumes at the failover site will be brought online as active storage following the vendors and/or internal procedures a. Change ds* DNS records b. Make cloned/snapshot/synthetic VMs online (Tintri) and snapshot copy on e-series Internal Compute Management Failover All replicated VMware infrastructure such as vcenter, vshield Manager, vcloud Director will be brought online Customer Compute Failover All failover compute resources will be brought online on the failover VLANs a. Take failover hosts out of maintenance mode Customer VM Failover All Customer VMs which have been replicated will be brought online following the priorities set forth by their designated Restore Time Objective (RTO), i.e. 6 hour RTO customers first, then 24 Hour RTO customers next, etc. Public 10

Failback Procedures Once the failed facility has been fully repaired and cleared for use by both the data center provider and Green Cloud Engineering, a determination may be made to return services to the vacant facility. In many cases, it is preferable to leave restored/recovered VMs running at the secondary site for an extended period of time. A failback procedure occurs in much the same way as a failover but in the opposite direction. Because of this downtime is to be expected. Engineering and Sr. Management will make the decision as to if, when and how to perform such an operation. Public 11