IBM TS7700 Series Grid Failover Scenarios Version 1.4

Similar documents
IBM Virtualization Engine TS7700 Series Best Practices. Copy Consistency Points V1.2

IBM TS7700 Series Best Practices. Flash Copy for Disaster Recovery Testing V1.1.1

IBM TS7700 grid solutions for business continuity

IBM Virtualization Engine TS7700 Series Best Practices. TS7700 Hybrid Grid Usage V1.1

IBM Virtualization Engine TS7700 Series Best Practices. Cache Management in the TS7720 V1.6

TS7700 Technical Update TS7720 Tape Attach Deep Dive

IBM Virtualization Engine TS7700 Series Best Practices. TPF Host and TS7700 IBM Virtualization Engine V1.1

Introduction and Planning Guide

TS7700 Technical Update What s that I hear about R3.2?

Introduction and Planning Guide

IBM Virtualization Engine TS7720 and TS7740 Releases 1.6, 1.7, 2.0, 2.1 and 2.1 PGA2 Performance White Paper Version 2.0

Exam Actual. Higher Quality. Better Service! QUESTION & ANSWER

IBM System Storage TS7740 Virtualization Engine now supports three cluster grids, Copy Export for standalone clusters, and other upgrades

IBM TS7700 Series Operator Informational Messages White Paper Version 4.1.1

IBM TS7700 Series Operator Informational Messages White Paper Version 2.0.1

IBM Virtualization Engine TS7700 Series Copy Export Function User's Guide Version 2.1.5

TS7720 Implementation in a 4-way Grid

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper By Anton Els

IBM Virtualization Engine TS7720 and TS7740 Release 3.0 Performance White Paper - Version 2

Datacenter replication solution with quasardb

Understanding high availability with WebSphere MQ

IBM Virtualization Engine TS7700 Series Best Practices. Usage with Linux on System z 1.0

IBM Virtualization Engine TS7700 Series Encryption Overview Version 1.1

IBM High End Taps Solutions Version 5. Download Full Version :

IBM TS7760 and TS7760T Release 4.0 Performance White Paper Version 2.0

April 21, 2017 Revision GridDB Reliability and Robustness

Technology Insight Series

DISK LIBRARY FOR MAINFRAME

IBM Virtualization Engine TS7700 Series Statistical Data Format White Paper Version 1.6

DISK LIBRARY FOR MAINFRAME

Microsoft SQL Server Fix Pack 15. Reference IBM

EMC for Mainframe Tape on Disk Solutions

Broker Clusters. Cluster Models

IBM TS7720 supports physical tape attachment

IBM TS7720, TS7720T, and TS7740 Release 3.2 Performance White Paper Version 2.0

DISK LIBRARY FOR MAINFRAME

IBM System Storage TS1130 Tape Drive Models E06 and other features enhance performance and capacity

DLm TM TRANSFORMS MAINFRAME TAPE! WHY DELL EMC DISK LIBRARY FOR MAINFRAME?

DLm8000 Product Overview

IBM. DFSMS Introduction. z/os. Version 2 Release 3 SC

Universal Storage Consistency of DASD and Virtual Tape

Paradigm Shifts in How Tape is Viewed and Being Used on the Mainframe

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

Data Loss and Component Failover

IBM System Storage DS5020 Express

DB2 Data Sharing Then and Now

Achieving Continuous Availability for Mainframe Tape

Introduction and Planning Guide

Chapter 2 CommVault Data Management Concepts

Hitachi Content Platform Failover Processing Using Storage Adapter for Symantec Enterprise Vault

IBM Virtualization Engine TS7700 supports disk-based encryption

Distributed System Chapter 16 Issues in ch 17, ch 18

EMC Disk Library Automated Tape Caching Feature

IBM MQ Appliance HA and DR Performance Report Version July 2016

Availability Implementing high availability

IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018

Improve Disaster Recovery and Lower Costs with Virtual Tape Replication

Documentation Accessibility. Access to Oracle Support

Chapter 18 Distributed Systems and Web Services

VCS-276.exam. Number: VCS-276 Passing Score: 800 Time Limit: 120 min File Version: VCS-276

StorageTek ACSLS Manager Software

Vendor: Hitachi. Exam Code: HH Exam Name: Hitachi Data Systems Storage Fondations. Version: Demo

DISK LIBRARY FOR MAINFRAME (DLM)

REC (Remote Equivalent Copy) ETERNUS DX Advanced Copy Functions

Agenda for IBM Tape Solutions

TECHNICAL ADDENDUM 01

Module 15: Network Structures

IBM Virtualization Engine TS7700 Series Best Practices. TS7700 Logical WORM Best Practices

Introduction to shared queues

Veritas Volume Replicator Option by Symantec

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

MIMIX. Version 7.0 MIMIX Global Operations 5250

SMC Client/Server Implementation

EMC DATA DOMAIN OPERATING SYSTEM

IBM 3494 Peer-to-Peer Virtual Tape Server Enhances Data Availability and Recovery

Synergetics-Standard-SQL Server 2012-DBA-7 day Contents

Module 16: Distributed System Structures. Operating System Concepts 8 th Edition,

IBM TotalStorage Enterprise Storage Server Model 800

Virtual Disaster Recovery

Chapter 3 `How a Storage Policy Works

The Collaboration Cornerstone

CA Vtape Virtual Tape System CA RS 1309 Service List

EMC CLARiiON Backup Storage Solutions

How Symantec Backup solution helps you to recover from disasters?

Availability Implementing High Availability with the solution-based approach Operator's guide

PracticeTorrent. Latest study torrent with verified answers will facilitate your actual test

BUSINESS CONTINUITY: THE PROFIT SCENARIO

IBM Tivoli System Automation for z/os

EMC Solutions for Backup to Disk EMC Celerra LAN Backup to Disk with IBM Tivoli Storage Manager Best Practices Planning

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

EMC VPLEX Geo with Quantum StorNext

Configuring IBM Spectrum Protect for IBM Spectrum Scale Active File Management

Mainframe Backup Modernization Disk Library for mainframe

IBM Spectrum Protect HSM for Windows Version Administration Guide IBM

IBM Active Cloud Engine centralized data protection

IBM TS7700 v8.41 Phase 2. Introduction and Planning Guide IBM GA

iscsi Technology Brief Storage Area Network using Gbit Ethernet The iscsi Standard

IBM Software. IBM z/vm Management Software. Introduction. Tracy Dean, IBM April IBM Corporation

Chapter 7. GridStor Technology. Adding Data Paths. Data Paths for Global Deduplication. Data Path Properties

Oracle StorageTek's VTCS DR Synchronization Feature

Transcription:

July 2016 IBM TS7700 Series Grid Failover Scenarios Version 1.4 TS7700 Development Team Katsuyoshi Katori Kohichi Masuda Takeshi Nohta Tokyo Lab, Japan System and Technology Lab Copyright 2006, 2013-2016 IBM Corporation

Table of Contents Introduction... 4 Summary of Changes... 4 Test Configuration... 5 Test Job Mix... 8 TS7700 Grid Failure Mode Principles... 8 Autonomic Ownership Takeover Manager... 10 Part I: Failover scenarios for 2-Way clusters Grid configuration... 13 Failure of a Host Link to a TS7700... 14 Failure of all Host Links to a TS7700... 15 Failure of One Link Between TS7700s... 16 Failure of Both Links Between TS7700s w/local Mounts Only... 17 Failure of Both Links Between TS7700s w/remote Mounts... 18 Failure of Both Links Between TS7700s and Ownership Transfer... 19 Failure of one Host Link to the Remote TS7700... 21 Failure of all Host Links to the Remote TS7700... 22 Failure of the Local TS7700... 23 Failure of the Remote TS7700... 25 Failure of Both Links Between TS7700s W/Autonomic Ownership Takeover... 26 Failure of the Local TS7700 W/Autonomic Ownership Takeover for Read... 28 Failure of the Local TS7700 W/Autonomic Ownership Takeover for Write... 30 Failure of All Links between Sites W/Autonomic Ownership Takeover... 32 Failure of Gb Links and One W/Autonomic Ownership Takeover... 34 Part II: Failover scenarios for 3-Way clusters Grid configuration... 36 Failure of a link between cluster0 and Grid network... 37 Failure of both links between cluster0 and Grid network... 38 Failure of a Link between cluster0 and Grid network w/remote Mounts... 39 Failure of both links between cluster1 and Grid network... 40 Failure of cluster0 in three clusters Grid... 41 Page 2 of 76

Failure of a link between cluster2 and Grid network... 43 Failure of both links between cluster2 and Grid network... 44 Failure of the remote TS7700... 45 Failure of one local TS7700D in Hybrid 3-way clusters Grid... 46 Failure of both links between cluster2 and Grid network in Hybrid 3-Way clusters Grid... 48 Failure of remote TS7700 in Hybrid 3-Way clusters Grid (1)... 49 Failure of remote TS7700 in Hybrid 3-Way clusters Grid (2)... 50 Failure of all grid links between local site and the Grid network... 51 Failure of both local TS7700s... 53 Failure of whole local site... 55 Part III: Failover scenarios for 4-way clusters Grid configuration... 57 Failure of one local TS7700 in four clusters Grid... 58 Failure of one local TS7700 w/ partitioned workload... 60 Remote sites failure in Hybrid 4-Way clusters Grid w/ partitioned workload... 62 Failure of local TS7700D in Hybrid 4-Way clusters Grid... 63 Failure of local TS7700 in Hybrid 4-Way clusters Grid... 65 Part IV: Failover scenarios for 3-way clusters Grid configuration w/synchronous Mode Copy... 67 Introduction of Synchronous Mode Copy:... 68 Failure of one local cluster in 3-Way clusters Grid with sync mode copy enabled (both sync copy clusters are local)... 69 Failure of one local cluster in 3-Way clusters Grid with sync mode copy enabled... 71 Failure of one local cluster in 3-Way clusters Grid with sync mode copy enabled (synchronous deferred option enabled)... 73 Failure of one local cluster in 3-Way clusters Grid with sync mode copy enabled using Dual Open On Private Mount option (1)... 74 Failure of one local cluster in 3-Way clusters Grid with sync mode copy enabled using Dual Open On Private Mount option (2)... 75 Page 3 of 76

Introduction The IBM TS7700 Series is the latest in the line of tape virtualization products that has revolutionized the way mainframe customers utilize their tape resources. The capability to resume business operations in the event of a product or site failure is provided by the TS7700 Grid configuration. In a Grid configuration, up to six TS7700 clusters are interconnected and can replicate data created between any of the clusters in the configuration. As part of a total systems design, business continuity procedures must be developed to instruct I/T personnel in the actions that need to be taken in the event of a failure. Testing of those procedures should be performed either during initial installation of the system or at some interval. This paper was written in an effort to assist IBM specialists and customers in developing such testing plans as well as better understand how the TS7700 will respond to certain failure conditions. The paper documents a series of TS7700 Grid failover scenarios for z/os which were run in an IBM laboratory environment. Single failures of all major components and communication links and some multiple failures are simulated. For each of the scenarios, the z/os console messages that are typically presented are indicated (depending on how the FICON channels are configured between the host and the TS7700, some of the messages may not be generated). Obviously, not all possible failover situations could be covered. The focus of this paper is on those which demonstrate the critical hardware and microcode failover capabilities of the TS7700 Grid configuration. It is assumed throughout this white paper that the reader is familiar with using virtual tape systems attached to z/os environments. Throughout this document TS7700 is a generic term that refers to the latest model and its architecture, much like VTS was used to describe the prior generation. At the code level 8.40.x.x, the new model TS7760 is introduced. The following model notations are used throughout this white paper: - : V06 and V07 - TS7720: VEA and VEB with no tape library. - TS7720T: VEB with tape library - TS7760: VEC with no tape library - TS7760T: VEC with tape library - TS7700: All models (, TS7720, TS7720T, TS7760 and TS7760T) are included. - TS7700D: TS7700 Disk only models (TS7720 and TS7760) are included. - TS7700T: TS7700 Tape attach models (TS7720T and TS7760T) are included ( is NOT included in the white paper). Summary of Changes Version 1.0 Initial version Version 1.1 o Minor updates to wording in the introduction section o Note these tests are not part of the normal installation of the product Version 1.2 Added failover scenarios for three and four cluster grids. Version 1.3 Oct 2013 o Added scenarios for Sync Mode Copy o Clarifications and updates throughout. Page 4 of 76

Version 1.4 July 2016 o Fix typo and remove Virtualization Engine. o Added TS7700 model descriptions (use the new notation of each TS7700 model throughout this white paper). Test Configuration The hardware configurations used for the laboratory test scenarios are illustrated below. For the Autonomic Ownership Takeover scenarios, one or more s (IBM TS3000 System Consoles) attached the TS7700s are required as well as an Ethernet connection between s when more than one exists. Although all the components tested were local, the results of the tests should be similar, if not the same, for remote configurations. All FICON connections were direct, but again, the results should be valid for configurations utilizing FICON directors or channel extenders. Any supported level of z/os software, and current levels of TS7700, 3953 and 3584 microcode should all provide similar results. The test environment was z/os with JES2. Failover behaviors within the TS7700 are the same for all supported host platforms, although host messages will differ and host recovery capabilities may not be supported in all environments. Test results should also be valid for configurations using the 3494 tape library versus the latest TS3500. One of the architectural differences between the TS7700 Grid configuration and the prior VTSs PTP configuration is the elimination of the Virtual Tape Controllers (VTCs). The VTCs provided three major functions, 1) management of what and how to replicate data between the VTSs, 2) determination of which VTS has a valid copy of a logical volume and 3) selection of a VTS to handle the I/O operations for a tape mount and the routing of host I/O operations to that VTS. With the TS7700 Grid configuration, the first two functions have been integrated into each TS7700 cluster s function. For the third function, the attached host in combination with TS7700 Device Allocation Assist and Scratch Allocation Assist, selects which TS7700 will handle the tape mount and the I/O operations associated with it. During the laboratory tests, all virtual devices in all TS7700 clusters were online to the test host as shown in the following figures. Scratch Allocation Assist was not enabled. For the two cluster configuration show in Figure 1, all host jobs are routed to the virtual device addresses associated with TS77000-0. The host connections to the virtual device addresses in TS77000-1 are used in testing recovery for a failure of TS7700-0. In the three cluster Grid configuration shown in Figure 2, the host is connected in a balanced mode to the virtual device addresses in TS7700-0 and TS7700-1. TS7700-2 is used for testing recovery when both TS7700-0 and TS7700-1 fail. In the four cluster configuration shown in Figure 3, the host has logical devices for TS7700-0 and TS7700-1 online while TS7700-2 and TS7700-3 are used for recovery. Page 5 of 76

z/os Host TS7700-0 TS7700-1 network Grid network Figure 1: Hardware configuration for test scenarios for two clusters Grid. Page 6 of 76

z/os Host TS7700-0 TS7700-1 Grid network TS7700-2 network Figure 2: Hardware configuration for test scenarios for three clusters Grid. z/os Host TS7700-0 TS7700-1 TS7700-2 TS7700-3 network Grid network Figure 3: Hardware configuration for test scenarios for four clusters Grid. Page 7 of 76

Note: The test outlines in this white paper are a suggestion of how a customer might test their recovery scenarios in the event of a failure in the TS7700 Grid or its related interconnections. They are not part of the installation of the TS7700 and any IBM service representative involvement is not included in the costs associated with the install. Test Job Mix The test jobs running during each of the failover scenarios consisted of 10 jobs which mounted single specific logical volumes for input (read), and 5 jobs which mounted single scratch logical volumes for output (write). The mix of work used in the tests was purely arbitrary, and any mix would be suitable. TS7700 Grid Failure Mode Principles A TS7700 Grid configuration provides the following availability and data access characteristics: The virtual device addresses for each cluster are independent. This is different than the prior generation s PTP VTS where the mount request was issued on a virtual device address defined for a virtual tape controller and the virtual tape controller then decided which VTS to use for data access. Any mount to any device within any cluster provides access to all volumes contained within any cluster within the grid. Thus, devices simply need to be varied on to at least one cluster within a grid. All logical volumes are accessible through any of the virtual device addresses on the TS7700s in the Grid configuration. The preference will be to access a copy of the volume in the tape volume cache associated with the TS7700 cluster the mount request is received on. If a recall is required to place the logical volume in the tape volume cache on that, it will be done as part of the mount operation. If a copy of the logical volume is not available at the mount point TS7700 (either because it does not have a copy or the copy it does have is inaccessible due to an error), and a copy is available at another TS7700 in the Grid, the volume is accessed through the tape volume cache at the TS7700 that has the available copy. The TCP/IP Grid network infrastructure is essentially used as a channel extender but is able to do so without the FICON protocol overhead and also accesses data in compressed form. If a recall is required to place the logical volume in the tape volume cache on alternate, it will be done as part of the mount operation. If a recall is required to place the volume in the cache of cluster the mount request was received on and a peer cluster already contains a copy in cache, the TS7700 may use the Grid to access the peer version versus waiting for a recall to complete. Whether a copy is available at another TS7700 cluster depends on the copy consistency point that had been assigned to the logical volume when it was written. The copy consistency point is set through the management class storage construct. It specifies if and when a copy of the data is made between the TS7700s in the Grid configuration. There are four copy consistency policies that can be assigned: Synchronous Mode Copy Consistency Point: As data arrives off the FICON channel, it s compressed and then simultaneously duplexed to two TS7700 clusters at the same time. Memory buffering is used in order to enable this consistency policy to operate at long distances with very attractive performance. Applications naturally harden data to tape by issuing SYNCH commands at critical points throughout a job in which the TS7700 uses this SYNCH operation to flush any buffered content and harden all data up to that point on tape at both locations. This provides a zero recovery point objective at sync point granularity which is critical for applications such as DFSMShsm or OAM Object Support. In the event no SYNCH operations occur, one copy may lag by a few megabytes and will be synchronized implicitly during tape close processing. Any two locations may be Page 8 of 76

configured as the consistency points and the local mount point cluster is not required to be one of the two. Additional copies can be made at alternate clusters using the remaining copy policies. Rewind Unload Copy Consistency Point: If a data consistency point of RUN is specified, the data created on any TS7700 is copied to the one or more specified TS7700s as part of successful rewind unload command processing, meaning that for completed jobs, a copy of the volume will exist on all TS7700s configured as Synchronous and Rewind Unload. Access to data written by completed jobs (successful rewind/unload) prior to the failure is maintained through the other TS7700 cluster. Deferred Copy Consistency Point: If a data consistency point of Deferred is specified, the data created on any TS7700 is copied to one more ore other TS7700s after successful rewind unload command processing. Access to the data through the other TS7700 cluster is dependent on when the copy completes and whether another cluster containing a copy is accessible. Because there will be some delay in performing the copy, access may or may not be available when a failure occurs. No Copy Copy Consistency Point: If a data consistency point of No Copy is specified, the data created on any TS7700 is not copied to the specified TS7700s. If these No Copy TS7700s are the only TS7700 clusters available after an outage, data would be inaccessible. until the peer TS7700 cluster or clusters containing copes are restored. The volume removal policy has been introduced in release 1.6 microcode level for hybrid Grid configurations. Beginning with release 1.7, it is available in any Grid configuration which contains at least one TS7700D cluster. Since the TS7700 Disk-Only solution has a maximum storage capacity that is the size of its tape volume cache, once cache fills, this policy allows logical volumes to be automatically removed from cache while a copy is retained within one or more peer clusters in the Grid. When the auto removal starts, all volumes in fast-ready (scratch) category are removed first since these volumes are intended to hold temporary data. This mechanism could remove old volumes in a private category from the cache to meet pre-defined cache usage threshold as long as a copy of the volume is retained on one of the peer clusters. A TS7700 cluster failure could affect the availability of old volumes if the cluster which removed the volume is the only one remaining. The TS7700 Grid architecture allows equal access to any volume within a grid from any cluster within the grid. The shared access of a particular volume is achieved through a dynamic volume ownership protocol. At any point in time a logical volume is owned by a cluster. The owning cluster has control over access to the volume and for changes to the attributes associated with the volume (such as category or constructs). The cluster that has ownership of a logical volume can change dynamically based on which cluster in the Grid configuration is requesting a mount of the volume. When a mount request is received on a virtual device address, the TS7700 cluster for that virtual device must have ownership of the volume to be mounted or must obtain the ownership from the cluster that currently owns it. If the TS7700 clusters in a Grid configuration and the communication paths between them are operational, the change of ownership and the processing of logical volume related commands are transparent in regards to the operation of the TS7700. However, if a TS7700 cluster that owns a volume is unable to respond to requests from other clusters, the operation against that volume will fail, unless some additional direction is given. In other words, clusters will not automatically assume or take over ownership of a logical volume, without being directed. This additional action is required to prevent invalid ownership acquisitions due to network-only failures where both clusters are still operational. When more than one cluster has ownership of a volume independently, that could result in the volume s Data or Attributes being changed on each cluster. If a TS7700 cluster has failed or is known to be unavailable (for example, Page 9 of 76

it is being serviced), its ownership of logical volumes need to be transferred to the other TS7700 cluster with one of the following modes, which can be set through the management interface. Read-only Ownership Takeover: When Read-only ownership takeover (ROT) is enabled for a failed cluster, ownership of a volume is allowed to be taken from the failed TS7700 cluster when a volume is accessed by a host operation. Only read access to the volume is allowed through the other TS7700 clusters in the Grid. Once ownership for a volume has been taken in this mode, any operation attempting to modify data on that volume or change its attributes is failed. The mode for the failed cluster remains in place until a different mode is selected or the failed cluster has been restored. Any volumes accessed during the outage which were taken over in this mode are reconciled once the original owner returns and all clusters are made aware of the final owner. In the event a volume was accessed and modified during the outage by the original owner (network outage only), no error event occurs given the temporary owner only had read access. Write Ownership Takeover: When Write Ownership Takeover (WOT) is enabled for a failed cluster, ownership of a volume is allowed to be taken from the failed TS7700 cluster when a volume is accessed by a host operation, Full access is allowed through other TS7700 clusters in the Grid. The mode for the failed cluster remains in place until a different mode is selected or the failed cluster has been restored. Any volumes accessed during the outage which were taken over in this mode are reconciled once the original owner returns and all clusters are made aware of the final owner and the latest properties and volume data. Replications are queued if data changed during the outage. In the event a volume was accessed and modified during the outage (network outage only) by the original owner and the temporary owner also modified the volume, the volume will be moved into an error state where manual intervention is required to choose the most valid version. Autonomic ownership takeover is designed to prevent such takeover enablement. Safety checks in manual enablement also prevent such a condition if the TS7700 and infrastructure believes only a network outage exists. Service Ownership Takeover: When a TS7700 cluster is placed in service mode, the TS7700 Grid will automatically enable Write Ownership Takeover mode against the serviced cluster. Though the result is identical to WOT, it is given a unique name to differentiate why it was enabled. This mode is not explicitly enabled through the management interface but is implicitly enabled by initiating the service preparation process. Autonomic Ownership Takeover Manager In addition to the manual setting of one of the ownership takeover modes, an optional automatic method (Autonomic Ownership Takeover Manager or AOTM) is available when each of the TS7700s are attached to a. Whether this function is enabled and how it operates is configurable by an IBM SSR and by a customer through the management interface. If a TS7700 detects a remote TS7700 has failed, a check is made through the s to determine if the owning TS7700 is inoperable or just the communication paths to it are not functioning. When distance exists between the two communicating clusters, independent s which are local to the distant clusters is recommended. The s are then inter-connected through TCP/IP which provides an alternate method of verifying if remote clusters are inoperable or if only a network outage exists. If the or s have determined that the owning TS7700 is inoperable, then it will enable either read or write ownership takeover, depending on what was set in the enablement options. Page 10 of 76

AOTM enables an ownership takeover mode after a grace period which is configurable. Therefore, jobs can intermediately fail with an option to retry until the AOTM enables the configured ROT or WOT takeover mode. The grace period is set to 20 minutes by default and can be lowered to a value of 10 minutes. The grace period is in place to allow temporary outages to heal before a takeover mode is enabled. The grace period starts when a TS7700 detects that a remote TS7700 has failed. Following OAM messages can be displayed up until the point when AOTM enables the configured ownership takeover mode: CBR3758E Library Operations Degraded CBR3785E Copy operations disabled in library CBR3786E VTS operations degraded in library CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname may be unavailable or a communication issue may be present CBR3750I Message from library libname: G0009 Autonomic ownership takeover manager within library libname has determined that library libname is unavailable. The Read/Write ownership takeover mode has been enabled. CBR3750I Message from library libname: G0010 Autonomic ownership takeover manager within library libname has determined that library libname is unavailable. The Read-Only ownership takeover mode has been enabled. A failure of a TS7700 cluster will cause the jobs using its virtual device addresses to abend. In order to re-run the jobs, host connectivity to the virtual device addresses in alternate TS7700 clusters must be enabled (if not already) and an appropriate ownership takeover mode may need to be selected. Scratch allocations can traditionally continue given they will favor ownership accessible volumes, but private mounts for read or modification may fail when the volume was owned by the downed cluster. As long as another TS7700 has a valid copy of a logical volume, the jobs which issue private mounts can be retried once an ownership takeover mode is manually or automatically enabled. Once scratch volumes owned by the remaining clusters are exhausted, WOT must be enabled against the downed cluster in order to utilize the additional scratch volumes which were owned by the downed cluster. The following table format is used to document each scenario: Scenario # Scenario title A description of the link or component failure(s) in this scenario. TS7720 cluster0 cluster1 TS7720 cluster2 cluster3 Actions required to test this scenario. Customer network A list of the effects of the failure(s) on the TS7700 Grid capabilities and operations. A list of possible host console messages with paraphrased text that may be posted during this scenario. Page 11 of 76

Actions required to recover from the failure(s) in this scenario. Actions required to resume normal operations after a test of this scenario. Page 12 of 76

Part I: Failover scenarios for 2-Way clusters Grid configuration Page 13 of 76

Failure of a Host Link to a TS7700 Failover Scenario # 1 X Failure of a host link to a TS7700 One host link to cluster0 fails. It may be that the intermediate FICON links, FICON directors, FICON channel extenders or remote channel extenders fail. cluster0 cluster1 Run jobs that access volume in cluster0 only. Disconnect a cable somewhere between the host and cluster0. All Grid components continue to operate. All channel activity on the failing host link is stopped. Host channel errors are reported or error information becomes available from the intermediate equipment. If there are alternate paths from the host to either TS7700, host I/O operations may continue. Ownership takeover modes are not needed. All data remains available. IOS450E Not operational path taken offline IOS050I Channel detected error IOS051I Interface timeout detected Normal error recovery procedures and repair will apply for the host channel and the intermediate equipment Contact your service representative for repair of the failed connection. Reconnect host cable. Page 14 of 76

Failure of all Host Links to a TS7700 Failover Scenario # 2 X X TS7700 TS7700 cluster0 cluster1 Failure of all host links to a TS7700 All host links to cluster0 fails. Run jobs that access devices in cluster0 only. Disconnect all cables from the host to cluster0. Although only two are shown, there can be up to 4 FICON paths per TS7700. Retry the failed jobs using the virtual device addresses associated with cluster1. Virtual tape device addresses for cluster0 will become unavailable; all other Grid components continue to operate. All channel activity on the failing host links are stopped. Host channel errors are reported or error information becomes available from the intermediate equipment. Jobs which were using the virtual device addresses of cluster0 will fail. All data remains accessible through the virtual device addresses associated with cluster1. Ownership takeover modes are not needed. IOS451E Boxed, No operational paths IOS050I Channel detected error IOS000I (and related) Data check/equipment check/i/o error/sim IOS002A No paths available IEF281I Device offline - boxed IEF524I/IEF525E Pending offline IEF696I I/O timeout CBR4195I/CBR4196D (and related) I/O error in library (only for mount commands) IEC215I (and related) Abend 714-0C - I/O error on close IEC210I (and related) Abend 214-0C - I/O error on read If possible, vary on remote devices to cluster1 and rerun the failed jobs using the virtual device addresses in cluster1. Normal error recovery procedures and repair will apply for the host channels and the intermediate equipment Contact your service representative for repair of the failed connections. Reconnect host cables. Vary cluster0 and its paths and virtual devices online from the host. Page 15 of 76

Failure of One Link Between TS7700s Failover Scenario # 3 Failure of one link between TS7700s One of the Gb Ethernet links between the cluster0 and the Grid network fails. cluster0 X cluster1 Run jobs that access volumes in cluster0 only. Disconnect one of the Gb Ethernet cables between the TS7700s. All Grid components continue to operate through the remaining link. All host jobs would continue to run. The Grid enters the Grid Links Degraded state and the VTS Operations Degraded state. Copies using the link at the time of the failure will be redirected to the other remaining link. Performance of copy operations may be reduced. If the TS7700 is operating with a high workload with a copy consistency point of RUN, the Immediate Mode Copy Completion s Deferred state may also be entered. Jobs using Synchronous Mode Copy may be slower given the overall bandwidth to the alternate TS7700 is reduced. Call home support is invoked. CBR3786E VTS operations degraded in library CBR3787E Immediate mode copy operations deferred in library (if RUN copy policy) CBR3796E Grid links degraded in library CBR3750I Message from library libname: G0030 Library libname, degraded_port Grid Link is degraded.. (degraded_port is Pri or Pri2 disconnected) Contact your service representative or local network personnel for repair of the failed connections. Reconnect Gb Ethernet cable. Page 16 of 76

Failure of Both Links Between TS7700s w/local Mounts Only Failover Scenario # 4 Failure of both links between TS7700s with local mounts only and no Synchronous Mode Copy Both of the Gb Ethernet links between the TS7700s fails. cluster0 cluster1 Run jobs that access devices in cluster0 only. Disconnect both of the Gb Ethernet cables between cluster0 and the Grid network. X X Jobs on virtual device addresses on cluster0 will continue to run if accessing logical volumes which are owned by cluster0. All scratch mounts to cluster0 will succeed so long as it owns one or more volumes in the scratch category at the time of mount operation. Once the scratch volumes owned by cluster0 are exhausted, scratch mounts will begin to fail. Jobs which access private volumes for read or mod that are owned by cluster1 will fail with a retry request. Ownership takeover is not recommended given cluster1 is still operational. Given this configuration where production runs only to one cluster, ownerships of private volumes are already most likely present within cluster0. All copy operations are stopped. The Grid enters the Grid Links Degraded state and the VTS Operations Degraded state. The Grid enters the Copy Operation Disabled state. If the RUN copy consistency point is being used, the Grid also enters the Immediate Mode Copy Completion s Deferred state. Call home support is invoked. CBR4195I/CBR4196D (and related) I/O error in library CBR3786E VTS operations degraded in library CBR3787E Immediate mode copy operations deferred in library CBR3785E Copy operations disabled in library CBR3796E Grid links degraded in library CBR3750I Message from library libname: G0030 Library libname, Pri, Pri2 Grid Link is degraded CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname may be unavailable or a communication issue may be present. Contact your service representative or local network personnel for repair of the failed connections. Reconnect Gb Ethernet cables. Page 17 of 76

Failure of Both Links Between TS7700s w/remote Mounts Failover Scenario # 6 cluster0 X X Failure of both links between TS7700s with remote mounts and no Synchronous Mode Copy Both of the Gb Ethernet links between the TS7700s fails. Create data only on cluster1 using a management class that specifies that cluster0 is not to have a copy. Run specific mount jobs to devices on cluster0 that access the data only present on cluster1. This will result in the TVC associated with cluster1 to be selected for the mount. Disconnect both of the Gb Ethernet cables between the cluster0 and the Grid network. Jobs on virtual device addresses on cluster0 that are using cluster1 as the TVC cluster will fail. Subsequent specific mount jobs that attempt to access the data through cluster0 that only exists on cluster1 will fail. All scratch mounts to cluster0 will succeed so long as it owns one or more volumes in the scratch category at the time of mount operation. Once the scratch volumes owned by cluster0 are exhausted, scratch mounts will begin to fail. Scratch mounts which use the same previously defined management class which only creates content in cluster1 will fail. All copy operations are stopped. The Grid enters the Grid Links Degraded state, the VTS Operations Degraded state and the Grid enters the Copy Operation Disabled state. Call home support is invoked. IOS000I (and related) Data check/equipment check/i/o error/sim CBR4195I/CBR4196D (and related) I/O error in library CBR3786E VTS operations degraded in library CBR3787E Immediate mode copy operations deferred in library CBR3785E Copy operations disabled in library CBR3758E Library Operations Degraded CBR3796E Grid links degraded in library CBR3750I Message from library libname: G0030 Library libname, Pri, Pri2 Grid Link is degraded CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname may be unavailable or a communication issue may be present. IEC147I (and related) Abend 613-20 - CBRXLCS processing error Contact your service representative or local network personnel for repair of the failed connections. Reconnect Gb Ethernet cables. cluster1 Page 18 of 76

Failure of Both Links Between TS7700s and Ownership Transfer Failover Scenario # 7 Failure of both links between TS7700s and ownership transfer Both of the Gb Ethernet links between the TS7700s fails. cluster0 cluster1 X X Autonomic ownership takeover is not enabled for this test Use virtual device addresses on cluster1 to access or create several specific volumes so that ownership of those volumes will shift to cluster1 if not already there. Disconnect both of the Gb Ethernet cables between the cluster0 and the Grid network. Run specific mount jobs which attempt to access one or more of the volume whose ownership was transferred to cluster1 through the virtual device addresses associated with cluster0. Note: Do not place Grid into Write takeover modes when only the links have failed in a real configuration. That could allow a host attached to cluster1 to modify a volume which also was being modified by a host attached to cluster0. and AOTM will attempt to prevent manual enablement when this condition is true, but not all network-only conditions can be detected by the solution. Please verify the cluster is in fact down before manually enabling takeover. Jobs subsequent to the failure using virtual device addresses on cluster0 that need to access volumes that are owned by cluster1 will fail (even if the data is local to cluster0). Specific mount jobs subsequent to the failure using virtual device addresses on cluster0 that target a volume which are only consistent on cluster1 will fail. All scratch mounts to cluster0 will succeed so long as it owns one or more volumes in the scratch category at the time of mount operation and it specifies a management class that has a consistency point other than No Copy at cluster0. Once the scratch volumes owned by cluster0 are exhausted, scratch mounts will begin to fail. All copy operations are stopped. The Grid enters the Grid Links Degraded state and the VTS Operations Degraded state and the Grid enters the Copy Operation Disabled state. If the RUN copy consistency point is being used, the Grid also enters the Immediate Mode Copy Completion s Deferred state. If Synchronous Mode copy is used, the Grid also enters the Synchronous-Deferred state for the next scratch mount or mod which occurs to a synchronous mode copy defined volume. If the fail on synch failure option is used, these jobs will fail. Call home support is invoked. If ownership takeover is enabled against cluster1, operations will continue but any chance of modification from cluster1 devices of the same volumes introduces risk. If ownership takeover must be enabled, it s recommended to only enable ROT vs WOT. If WOT is enabled, you must be confident that no host activity to the same volume ranges is occurring within cluster1. If a AOTM setup is configured (enabled or disabled), it will prevent such a manual enablement if it can detect that cluster1 is in fact still running. Page 19 of 76

CBR4174I Cannot obtain ownership volume volser in library libname (Note: this message indicates that an operation was attempted that requires volume ownership and volume ownership could not be obtained). CBR3786E VTS operations degraded in library CBR3787E Immediate mode copy operations deferred in library CBR3730E One or more synchronous mode copy operations deferred in library CBR3785E Copy operations disabled in library CBR3758E Library Operations Degraded CBR3796E Grid links degraded in library CBR3750I Message from library libname: G0030 Library libname, Pri, Pri2 Grid Link is degraded CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname may be unavailable or a communication issue may be present. Contact your service representative or local network representative for repair of the failed connections. Do not place cluster0 in an ownership takeover mode unless a unique situation requires it. Reconnect Gb Ethernet cables. Page 20 of 76

Failure of one Host Link to the Remote TS7700 Failover Scenario # 8 X Failure of one host link to the remote TS7700 One host link to cluster1 fails. It may be that the intermediate FICON directors, FICON channel extenders or remote channel extenders fail. cluster0 cluster1 Although a host is attached to cluster1, all operations are only using the paths to cluster0. Disconnect one of the host links to cluster1. No I/O operations are affected. All Grid components continue to operate. Any host LPARs exclusively connected through the failed link will not receive any z/os console messages initiated by the TS7700 Grid. IOS001E Inoperative Path IOS450E Not operational path taken offline IOS050I Channel detected error Contact your service representative for repair of the failed connections. Reconnect host cable. Page 21 of 76

Failure of all Host Links to the Remote TS7700 Failover Scenario # 9 cluster0 X X cluster1 Failure of all host links to the remote TS7700 All host links to cluster1 fails. Although a host is attached to cluster1, all operations are only using the paths to cluster0. Disconnect all cables from the host to cluster1. Although only two are shown, there can be up to 4 FICON parts per TS7700. All Grid components continue to operate. Any host LPARs exclusively connected through the failed links will not receive any z/os console messages initiated by the TS7700 Grid. IOS450E Not operational path taken offline IOS050I Channel detected error IOS002A No paths available Normal error recovery procedures and repair will apply for the host channels and the intermediate equipment Contact your service representative for repair of the failed connections. Reconnect host cables. Vary cluster1 and its paths and virtual devices online from the host. Page 22 of 76

Failure of the Local TS7700 Failover Scenario # 10 Failure of the local TS7700 TS7700-0 (cluster0) fails. cluster0 X cluster1 Page 23 of 76 Autonomic ownership takeover is not enabled for this test. Power off cluster0 through the management interface or disconnect FICON cables from the host to cluster0 and Grid links between cluster0 and the Grid network. Run specific mount jobs which read volumes that are owned by cluster0 using the virtual device addresses associated with cluster1. These will fail because ownership of volumes cannot be transferred. Enable read-only ownership takeover mode against cluster0 through the management interface on cluster1. Run specific mount jobs which read data in volumes that are owned by cluster0 again. These jobs will now run successfully because cluster1 takes over the volumes from cluster0. Run specific mount jobs that attempt to write data to volumes that cluster1 took over. These jobs will fail (ISO000 message will indicate write protected) because logical volumes taken over under read-only ownership takeover mode is restricted to read-access only. Enable write ownership takeover mode against cluster0 on cluster1. All jobs will now run successfully. Virtual tape device addresses for cluster0 will become unavailable. All channel activity on the failing host links is stopped. Host channel errors are reported or error information becomes available from the intermediate equipment. Jobs which were using the virtual device addresses of cluster0 will fail. Scratch mounts that target volumes that are owned by the failed cluster will also fail until write ownership takeover mode is enabled. This would only occur once all scratch candidates on cluster1 are exhausted since scratch mounts that target pre-owned volumes will succeed The Grid enters the Copy Operation Disabled and VTS Operations Degraded states. If the RUN copy consistency point is being used, the Grid also enters the Immediate Mode Copy Completion s Deferred state. If Synchronous Mode copy is used, the Grid also enters the Synchronous-Deferred state for the next scratch mount or mod which occurs to a synchronous mode copy defined volume. If the fail on synch failure option is used, these jobs will fail. All previously copied data can be made accessible through cluster1 through one of the takeover modes. If a takeover mode for cluster0 is not enabled, data will likely not be accessible through cluster1 even if it has a valid copy of the data if the volume is owned by cluster0 because cluster0 likely owned all volumes. If not, then those previously owned by cluster1 will be accessible. IOS450E Not operational path taken offline IOS001E/IOS4510E Boxed, No operational paths IOS050I Channel detected error IOS051I Interface timeout detected IOS000I (and related) Data check/equipment check/i/o error/sim/write protected

IOS002A No paths available IEF281I Device offline - boxed IOS1000I Write protected CBR4174I Cannot obtain ownership volume volser in library libname (Note: this message indicates that an operation was attempted to require volume ownership and volume ownership could not be obtained). CBR3786E VTS operations degraded in library CBR3787E Immediate mode copy operations deferred in library CBR3730E One or more synchronous mode copy operations deferred in library CBR3785E Copy operations disabled in library CBR3750I Message from library libname: G0007 A user at library libname has enabled Read/Write takeover against library libname CBR3750I Message from library libname: G0008 A user at library libname has enabled Read-Only takeover against library libname CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname may be unavailable or a communication issue may be present. IEC147I (and related) Abend 613-24 AN ATLDS tape volume was opened for output processing and it is file protected. Enable write or read-only ownership takeover mode through the management interface. Write ownership takeover mode must be enabled if scratch mounts are failing or private mounts with mod are required. Rerun the failed jobs using the virtual device addresses associated with cluster1. Normal error recovery procedures and repair will apply for the host channels and the intermediate equipment Contact your service representative for repair of the failed TS7700. Power on cluster0 or reconnect host and Gb Ethernet cables. Vary cluster0 and its paths and virtual devices online from the host. Page 24 of 76

Failure of the Remote TS7700 Failover Scenario # 11 cluster0 cluster1 X Failure of the remote TS7700 TS7700-1(cluster1) fails. Power off cluster1 through the management interface or disconnect FICON cables from the host to cluster1 and Grid links between cluster0 and the Grid network. All specific mount jobs continue to run. All scratch mounts to cluster0 will succeed so long as it owns one or more volumes in the scratch category at the time of mount operation. Once the scratch volumes owned by cluster0 are exhausted, scratch mounts will begin to fail. All copy operations are stopped. The Grid enters the Copy Operation Disabled and VTS Operations Degraded states. If the RUN copy consistency point is being used, the Grid also enters the Immediate Mode Copy Completion s Deferred state. If Synchronous Mode copy is used, the Grid also enters the Synchronous-Deferred state for the next scratch mount or mod which occurs to a synchronous mode copy defined volume. If the fail on synch failure option is used, these jobs will fail. Call home support is invoked. CBR3786E VTS operations degraded in library CBR3787E Immediate mode copy operations deferred in library CBR3730E One or more synchronous mode copy operations deferred in library CBR3785E Copy operations disabled in library CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname may be unavailable or a communication issue may be present. Contact your service representative for repair of the failed TS7700. Power on the TS7700 or reconnect host and Gb Ethernet cables. Vary cluster1 and its paths and virtual devices online from the host. Page 25 of 76

Failure of Both Links Between TS7700s W/Autonomic Ownership Takeover Failover Scenario # 12 cluster0 X X Customer network cluster1 Failure of both links between TS7700s W/Autonomic Ownership Takeover Both of the Gb Ethernet links between the TS7700s fails. Autonomic ownership takeover is enabled for this test Use virtual device addresses on cluster1 to access several specific volumes so that ownership of those volumes will shift to cluster1. Disconnect both of the Gb Ethernet cables between cluster0 and the Grid network. Run specific mount jobs which attempt to access one or more of the volume whose ownership was transferred to cluster1 through the virtual device addresses associated with cluster0. Note: The results will be the same as for scenario 6 because the s will determine that cluster0 is still operable and that takeover is not allowed. Specific mount jobs subsequent to the failure using virtual device addresses on cluster0 that need to access volumes that are owned by cluster1 will fail (even if the data is local to cluster0). Jobs using virtual device addresses on cluster1 that need to access volumes that are owned by cluster0 will also fail. All scratch mounts to cluster0 will succeed so long as it owns one or more volumes in the scratch category at the time of mount operation. Once the scratch volumes owned by cluster0 are exhausted, scratch mounts will begin to fail. All copy operations are stopped. The Grid enters the Grid Links Degraded state, the VTS Operations Degraded state and the Copy Operation Disabled state. If the RUN copy consistency point is being used, the Grid also enters the Immediate Mode Copy Completion s Deferred state. If Synchronous Mode copy is used, the Grid also enters the Synchronous-Deferred state for the next scratch mount or mod which occurs to a synchronous mode copy defined volume. If the fail on synch failure option is used, these jobs will fail. Call home support is invoked. IOS000I (and related) Data check/equipment check/i/o error/sim CBR4174I Cannot obtain ownership volume volser in library libname (Note: this message indicates that an operation was attempted that requires volume ownership and volume ownership could not be obtained). CBR4195I/CBR4196D (and related) I/O error in library CBR3786E VTS operations degraded in library CBR3787E Immediate mode copy operations deferred in library CBR3730E One or more synchronous mode copy operations deferred in library CBR3785E Copy operations disabled in library CBR3758E Library Operations Degraded CBR3796E Grid links degraded in library CBR3750I Message from library libname: G0030 Library libname, Pri, Pri2 Grid Link is degraded. CBR3750I Message from library libname: G0013 Library libname has experienced an unexpected outage with its peer library libname. Library libname may be unavailable or a communication issue may be present. Page 26 of 76

IEC147I (and related) Abend 613-20 - CBRXLCS processing error Contact your service representative for repair of the failed connections. Reconnect Gb Ethernet cables. Page 27 of 76

Failure of the Local TS7700 W/Autonomic Ownership Takeover for Read Failover Scenario # 13 X cluster0 Customer network cluster1 Failure of the local TS7700 W/Autonomic Ownership Takeover for Read TS7700-0(cluster0) fails. Autonomic ownership takeover for read is enabled for this test. Power off cluster0 through the management interface or disconnect FICON cables from the host to cluster0 and Grid links between cluster0 and the Grid network. Run specific mount jobs which read data using the virtual device addresses associated with cluster1. These jobs will run successfully because the ownership of the volumes will be automatically taken over by cluster1. Run specific mount jobs that attempt to write data to the volumes that cluster1 took over from cluster0. These jobs will fail with an IOS message indicating the volume is write protected because the volumes that cluster1 took over under read-only takeover mode is restricted to readaccess only. Manually enable write ownership takeover mode for cluster0. Specific mount jobs with write jobs will now succeed. Virtual tape device addresses for cluster0 will become unavailable. All channel activities on the failing host links are stopped. Host channel errors are reported or error information becomes available from the intermediate equipment. Jobs which were using the virtual device addresses of cluster0 will fail. Scratch mounts that target volumes that are owned by the failed cluster will also fail until write ownership takeover mode is enabled. Scratch mounts that target pre-owned volumes will succeed. The Grid enters the Copy Operation Disabled and VTS Operations Degraded states. If the RUN copy consistency point is being used, the Grid also enters the Immediate Mode Copy Completion s Deferred state. If Synchronous Mode copy is used, the Grid also enters the Synchronous-Deferred state for the next scratch mount or mod which occurs to a synchronous mode copy defined volume. If the fail on synch failure option is used, these jobs will fail. All copied data can be read without operator action because an automatic transition to read-only ownership takeover mode is made. An operator must place cluster0 into write ownership takeover mode to allow volumes owned by cluster0 to be written to. IOS450E Not operational path taken offline IOS001E/IOS4510E Boxed, No operational paths IOS050I Channel detected error IOS051I Interface timeout detected IOS000I (and related) Data check/equipment check/i/o error/sim/write protected IOS002A No paths available IEF281I Device offline - boxed IOS1000I Write protected Page 28 of 76