MIMIX. Version 7.0 MIMIX Global Operations 5250

Size: px

Start display at page:

Download "MIMIX. Version 7.0 MIMIX Global Operations 5250"

Jeffrey Woods
6 years ago
Views:

1 MIMIX Version 7.0 MIMIX Global Operations 5250

2 Published: September 2010 level Copyrights, Trademarks, and tices Contents Version 7.0 MIMIX Global Operations 5250 Who this book is for... 5 What is in this book... 5 The MIMIX documentation set... 5 Chapter 1 Clustering Introduction 7 What is clustering?... 7 Clustering terminology... 8 Chapter 2 The MIMIX Global Solution 10 Simplified cluster configuration and management Cluster management from any system Application groups to enhance and simplify switching Customized automation scripting for switch processing Additional monitoring capability Simplified cluster administrative domain configuration Support for data protection technologies Journaling-based replication Switchable independent ASPs Geographic mirroring Mirrored SAN environments Requirements and considerations Chapter 3 Clustering Overview 18 Components of the IBM clustering framework Cluster resource services Support for logical replication Support for switchable device resources Support for resilient operational environments Clustering concepts Recovery domain Cluster events Failover Switchover Partition Rejoin System distress messages Resilient applications Chapter 4 Introduction 25 Common terms used throughout this document Effect of data group sets on controlling logical replication Data group set examples

3 Chapter 5 Status 30 Checking application group status Resolving problems reported in the Monitors field Resolving problems reported in the tifications field Resolving problems reported in Status columns Resolving a procedure status problem Resolving an *ATTN status for an application group Resolving other common status values for an application group Status for Work with de Entries Status for Work with Data Resource Group Entries Chapter 6 Working with status of procedures and steps 45 Displaying status of procedures Displaying status of the last run of all procedures Displaying available status history of procedure runs Resolving problems with procedure status Responding to a procedure in *MSGW status Resolving a *FAILED or *CANCELED procedure status Displaying status of steps within a procedure run Resolving problems with step status Responding to a step with a *MSGW status Resolving *CANCEL or *FAILED step statuses Changing status of a procedure Running a procedure of type *USER Canceling a procedure Chapter 7 Basic Operations 60 Starting MIMIX Ending MIMIX for non-dedicated backup processing Chapter 8 Advanced Operations 62 Choosing how to end MIMIX for restricted state processing or an IPL Ending MIMIX as part of an application outage Ending MIMIX as part of an environment outage Ending MIMIX as part of an isolated outage Verifying the sequence of the recovery domain Changing the sequence of backup nodes Examples of changing the backup sequence Chapter 9 Switching 73 Switch processing Performing a planned switch Responding to an unplanned switch Monitoring a switch in progress Monitoring data groups during a switch Check data groups for a specific application group Chapter 10 System Maintenance 76 Installing MIMIX service packs Installing service packs for instances in the system database Installing service packs for instances in a switchable independent ASP

4 MIMIX configuration under clustering System definitions Transfer definitions Data group definitions Index 83 4

5 Who this book is for Who this book is for What is in this book The MIMIX Global Operations book is for administrators and operators in an IBM i clustering environment who either use the basic clustering support provided within MIMIX or who use MIMIX Global to integrate cluster management with MIMIX logical replication or supported hardware-based replication techniques. The MIMIX Global Operations book describes basic clustering concepts and identifies supported clustering implementations. It provides operational procedures for addressing status issues as well as for normal operations such as starting, ending, or switching replication. The MIMIX documentation set The following documents about MIMIX products are available: Using License Manager This book describes software requirements, system security, and other planning considerations for installing MIMIX software and software fixes. The preferred way to obtain license keys and install software is by using AutoValidate and the MIMIX Installation Wizard. However, if you cannot use them, this book provides instructions for obtaining licenses and installing software from a 5250 emulator. This book also describes how to use the additional security functions from Vision Solutions which are available for MIMIX products and commands through License Manager. Also, to support compatible previous releases, this book includes requirements and troubleshooting information for MIMIX Availability Manager. MIMIX Administrator Reference This book provides detailed conceptual, configuration, and programming information for MIMIX Enterprise and MIMIX Professional. It includes checklists for setting up several common configurations, information for planning what to replicate, and detailed advanced configuration topics for custom needs. It also identifies what information can be returned in outfiles if used in automation. MIMIX Global Operations This book provides high level concepts and operational procedures for MIMIX Global users in an IBM i cluster environment. This book focuses on addressing problems reported in status and basic operational procedures such as starting, ending, and switching. MIMIX Operations This book provides high level concepts and operational procedures for managing your high availability environment using MIMIX Enterprise or MIMIX Professional from a 5250 emulator. This book focuses on tasks typically performed by an operator, such as checking status, starting or stopping replication, performing audits, and basic problem resolution. 5

6 Using MIMIX Monitor This book describes how to use the MIMIX Monitor user and programming interfaces available with MIMIX Enterprise or MIMIX Professional. This book also includes programming information about MIMIX Model Switch Framework and support for hardware switching. Using MIMIX Promoter This book describes how to use MIMIX commands for copying and reorganizing active files. MIMIX Promoter is available with MIMIX Enterprise only. MIMIX for IBM WebSphere MQ This book identifies requirements for the MIMIX for MQ feature which supports replication in IBM WebSphere MQ environments. This book describes how to configure MIMIX for this environment and how to perform the initial synchronization and initial startup. Once configured and started, all other operations are performed as described in the MIMIX Operations book. 6

7 What is clustering? CHAPTER 1 Clustering Introduction IBM System i clustering integrates the functionality of reliable hardware, high availability software, and applications software into a robust, highly available, and resilient computing environment. MIMIX Global adds significant value to clustering environments by providing cluster management that focuses on applications, integrates multiple data protection technologies, and provides enhanced switching capabilities. This chapter defines clustering and its commonly used terminology. The MIMIX Global Solution on page 10 provides details of the value added by MIMIX Global. Clustering Overview on page 18 provides background information and a conceptual overview of System i clustering. What is clustering? Clustering is architected to facilitate continuous availability. Whether you experience a system outage, a site loss, or need planned downtime for system maintenance, access to the functions provided on a clustered system can be switched over to one or more other systems that contain a current copy of the critical application data. The fundamental concept of clustering is that of resilient resources -- data, processes, applications, and devices that can be recovered if the system on which they exist fails. With clustering, resources are located on more than one system. If the system which is the primary access point for a particular set of resilient resources should fail, a pre-defined backup system becomes the primary access point. System i clustering is designed so that each of the necessary components can work together to provide continuous availability for data and applications. The primary components of clustering, shown in Figure 1, are: Cluster framework - functions and programming interfaces provided by IBM i and additional licensed programs Resilient applications - software that is made resilient through exit programs which enable it to be controlled in a clustering environment, created with assistance from a high availability business partner (HABP) Resilient data - middleware software for a variety of high availability technologies and the exit programs which enable it to be controlled in a clustering environment, provided by Vision Solutions or IBM Cluster management - tools and user interfaces provided by IBM and by Vision Solutions 7

Clustering Introduction Figure 1..Components of a System i cluster solution Clustering terminology The following list introduces IBM terminology used for clustering.

8 Clustering Introduction Figure 1..Components of a System i cluster solution Clustering terminology The following list introduces IBM terminology used for clustering. A cluster is a collection of interconnected complete computers that work together as a single unified computing resource. The cluster is made up of one or more cluster nodes and is identified by a name comprised of 10 or fewer characters. A cluster node is any system or logical partition (LPAR) that is a member of a cluster. System i clustering supports up to 128 nodes in a cluster, but each node can be defined to only one cluster at a time. The set of nodes that are defined to a cluster is referred to as the cluster membership list. Cluster communication uses TCP/IP protocol to provide communication paths between each node in the cluster. A node must be connected to the cluster using an IP network in order to communicate with other nodes in the cluster. IBM recommends that you establish a dedicated path that is not shared by users or other network traffic. For simplicity, the name of a node is often the same name as the host or system name. This name is then mapped to an 8-character cluster node identifier that is associated with one or more Internet Protocol (IP) addresses that represent a system. Cluster resources are the resources that are required to be highly available by your business and are available to the nodes within a cluster. Cluster resources can be either moved or replicated to one or more nodes within a cluster. Examples include applications, data libraries, devices, and disk units. Resources are identified in cluster resource groups and controlled through cluster resource group exit programs. A cluster resource group (CRG) is an IBM i system object that identifies a collection of cluster resources to be monitored and managed as a single unit. Each CRG defines the relationship between the nodes associated with those resources in a recovery domain that determines role of each node in the CRG as well as the degree to which each can participate in events such as synchronizing or performing a recovery action. Several types of CRGs are available. Each of the following CRG 8

9 What is clustering? types is designed for a specific type of cluster resource: application, data, device, and peer. Each CRG has a CRG exit program that is called on each active node in the CRG s recovery domain in response to a cluster event. The exit program manages cluster events for the environment established by the CRG. All possible cluster events have a pre-determined response in the exit program code. 9

10 The MIMIX Global Solution CHAPTER 2 The MIMIX Global Solution MIMIX Global leverages the architected relationships within IBM System i clustering and Vision expertise with high availability solutions to provide maximum protection and maximum flexibility to clustered environments. MIMIX Global offers cluster management that raises the focus of your interaction with a cluster to the to the level of applications. This adds significant value by enabling the following: A single point for monitoring nodes and controlled recovery from failover events Support for and coordinated control of multiple data protection technologies, including hybrid environments A single button switch that coordinates switching your applications in conjunction with the data protection technologies in use Integration of customized scripts to automate actions during switching Automation that detects problems that would affect your recovery time objectives (RTOs) before you need to switch Simplified configuration for the cluster administration domain Seamless integration of cluster management with high availability solutions from Vision Solutions Simplified cluster configuration and management Unique features of cluster management capabilities provided by MIMIX Global include the ability to manage the cluster from any system, simplified configuration and control in complex environments, application level switching, and automation to detect potential problems before switching. Cluster management from any system Simply stated, multi-management support enables MIMIX Global as well as MIMIX replication products to be managed from more than one node in the cluster. In addition, multi-management support also enables other significant benefits. A key benefit for clustering is that multi-management support ensures that configuration information about data to be replicated is kept synchronized across nodes in the cluster. Even though a node may not be actively participating in data replication, it can easily assume the role of a backup node or the primary node because its replication configuration remains current. Multi-management support can simplify the configuration, operation, and maintenance of multi-node data replication environments. For example, many data replication products are designed to function between two nodes, a production and a backup. Without multi-management support, a classic one-to-many network with three nodes would typically require multiple instances of a replication product to 10

11 Simplified cluster configuration and management provide high availability and disaster recovery for the production system. Each instance would require separate configuration, operation, and maintenance. In the event that the production system is not available, each instance would have to be switched separately and management of the switching operation would require significant manual operations and planning. Multi-management support can significantly reduce, and may even eliminate, the need for multiple instances of the replication product. Multi-management support enables MIMIX Global to manage complex data replication configurations within a cluster along with all other cluster management operations from a single instance. Application groups to enhance and simplify switching MIMIX Global provides the application group construct to enhance the ability to group and control resources in a way that maintains relationships between resources when responding to cluster events. An application group identifies an application, including the application name, release level, its IP takeover address, an exit program to control the various actions, and the data associated with the application. An application group also includes the node entries that define the recovery domain for the application. The most significant benefit of application groups is that they provide the ability to control switching of all data replication associated with an application in a single switch request. When a cluster event occurs, the application group provides a coordinated response for all of its associated resources. Application groups also integrate and simplify cluster management and data replication management. Application groups can also be used when there is no application, only data that needs to be resilient. Customized automation scripting for switch processing MIMIX Global has the ability include customized scripts within the exit programs called by application groups to automate processes, enabling faster switch times. Certified consultants can create customized scripts for switch processing that automate a sequence of actions to be performed by application or data exit programs. Error handling for the result of each action can be included in the script. Exit programs shipped with MIMIX Global use this scripting capability to control starting, stopping, and switching replication from application groups. Additional monitoring capability MIMIX Global provides the following additional monitoring capabilities: Potential independent ASP overflow conditions - MIMIX Global provides improved capability for detecting potential independent ASP overflow conditions that put your high availability solution at risk due to insufficient storage. If an independent ASP overflows data may be lost and applications may no longer function. Consolidated application attention status - An application group status monitor is created when a MIMIX Global application group is created and notifies you of any 11

12 The MIMIX Global Solution conditions that require attention (*ATTN) for the application group. Conditions that cause prolonged switching when hardware is used as the HA technology - A switch delay monitor notifies you when conditions exist that may require action to prevent a prolonged switch that could jeopardize recovery time objectives (RTO). New objects not in the cluster administrative domain - The cluster administrative domain monitor periodically checks for new objects to be added to the IBM cluster administrative domain. The monitor will add a Monitored Resource Entry to the administrative domain for all objects that have been found. Simplified cluster administrative domain configuration MIMIX Global provides the ability to use generics when configuring the cluster administrative domain. Also, MIMIX Global includes the ability to automatically remove manage resource entries for deleted objects that have been flagged as failed. Support for data protection technologies IBM i clustering support is not aware of whether data is resilient. Instead, clustering provides interfaces through which other software products that provide data protection and resiliency can integrate with clustering support. Data resiliency is established through a data CRG by the synchronization and replication of data and objects from the primary node to a backup node in the recovery domain. The IBM System i supports the following data protection technologies for use with clustering: Journaling Switchable independent ASPs Mirrored independent ASPs and mirrored external storage (SAN-based) functions MIMIX Global fully supports all of these technologies through exit programs and provides coordination of cluster activities in hybrid environments which include combinations of data technologies. In addition, MIMIX Global seamlessly integrates cluster management with high availability solutions from Vision Solutions. Journaling-based replication Journal-based replication occurs at the object level. The operating system tracks changes to objects in user journals and the system journal. High availability software products from Vision Solutions read the journals in order to replicate them and apply the changes to a backup system. 12

Support for data protection technologies Journal-based data replication can take advantage of robust functionality and enhancements, including remote journaling, user journal support for non-database

13 Support for data protection technologies Journal-based data replication can take advantage of robust functionality and enhancements, including remote journaling, user journal support for non-database object types, and minimized journal data. Figure 2. Logical replication using remote journaling MIMIX Global provides data CRG exit programs which interface with Vision s high availability software products for data replication. MIMIX Global integrates its cluster management capabilities with journal-based replication solutions to ensure that the backup node has all data and other information necessary to run critical production applications and jobs following a failover or switchover. Data on backup systems can be made available for other activities, such as queries and saves. Because data is ready at the journal level, switches are fast. MIMIX Global simplifies control of broadcast-to-many environments and is not limited by distance for geographic dispersion of systems. Switchable independent ASPs When data replication is performed by switching independent auxiliary storage pools (ASPs), a single copy of the data exists. A device CRG identifies a collection of disks that can be varied on and off independently of the system ASP and basic ASPs (SYSBAS) in the event of a failover or switchover. One scenario using switchable independent ASPs is to switch a disk pool between LPARs on the same physical system (Figure 3). 13

14 The MIMIX Global Solution Figure 3. LPAR implementation of switchable disk pool In environments which implement switchable independent ASPs, MIMIX Global provides device CRG exit programs and manages switching. MIMIX Global also provides additional monitoring for and notification of conditions that may require action in order to prevent a prolonged switch. Switchable independent ASP support combined with logical replication software from Vision Solutions provides enhanced high availability and disaster recovery. Geographic mirroring Geographic mirroring is an independent ASP solution in which the IBM i operating system replicates data from independent ASPs at the memory page level (Figure 4). The independent ASP copy on the backup node is available for use after a detach operation stops its participation in the replication process. Geographic mirroring supports integrated environments (AIX, Linux, Windows). Synchronous operation with a 10 mile maximum distance is recommended. Because a failure could result in data loss for heavily used objects, journaling is also recommended. In environments which implement geographic mirroring, MIMIX Global provides device CRG exit programs and manages switching. MIMIX Global also provides additional monitoring for and notification of conditions that may require action in order to prevent a prolonged switch. Geographic mirroring support combined with logical replication software from Vision Solutions provides enhanced high availability and disaster recovery. 14

Peer to Peer Remote Copy (PPRC) provides a realtime copy on the backup storage server. te: A basic SAN environment without the use of PPRC mirroring software is not a cluster-enabled environment.

15 Support for data protection technologies Figure 4. Geographic mirroring Mirrored SAN environments Storage area network (SAN) environments perform replication at the disk sector level between two storage servers. Peer to Peer Remote Copy (PPRC) provides a realtime copy on the backup storage server. te: A basic SAN environment without the use of PPRC mirroring software is not a cluster-enabled environment. Basic SAN is a disaster recovery solution that does not require independent ASPs. If it is implemented within a cluster, an abnormal IPL is required on a failover. Metro mirroring (Figure 5) and global mirroring (Figure 6) are SAN implementations which integrate IBM TotalStorage PPRC functions with independent ASPs and clustering. Both solutions require the 5761-HAS, ihasm licensed program as well as licensed programs for Metro or Global PPRC software. For metro mirroring, synchronous replication with a 300 kilometer maximum distance is recommended. Figure 5. Metro mirroring example 15

16 The MIMIX Global Solution Global mirroring supports asynchronous replication over unlimited distances. Figure 6. Global mirroring example In both solutions, journaling is recommended to aid in recovery. MIMIX Global supports all storage solutions. When combined with logical replication software from Vision Solutions, MIMIX Global provides disaster recovery for data that is not within an independent ASP as well as additional monitoring for and notification of conditions that may require action in order to prevent a prolonged switch. Requirements and considerations The following requirements must be met for clustering: Clustering requires that the Internet Daemon (INETD) server is running on all nodes at all times. Consider the following: The Internet Daemon uses ports 5550 and The Internet Daemon should be configured to automatically start with TCP/IP on all cluster nodes. This should be done through iseries Navigator during cluster configuration. The Internet Daemon server will not start on any node where the QUSER user profile has *ALLOBJ special authority. The value of the network attribute ALWADDCLU (Allow Add Cluster) must be set to *ANY on all nodes that will be added to the cluster configuration. The system value QMLTTHDACN (Multithreaded job action) cannot be set to a value of 3 on any node in the cluster. Any address to be used for IP takeover must be available for use on every switchable node within the cluster. Reserve the IP addresses but do not create them in advance. The operating system will create them during cluster configuration. Once the addresses are created, they should not be configured to start automatically. The user profile used to run CRG exit programs must exist on all nodes in the 16

17 Requirements and considerations recovery domain for the CRG and must have *IOSYSCFG special authority. MIMIX Global requires a valid access code. If you do not have a valid *MIMIXCLU code, contact your Visions Solutions Sales Representative. 17

18 Clustering Overview CHAPTER 3 Clustering Overview This chapter provides background information and a conceptual overview of System i clustering. Components of the IBM clustering framework The IBM i operating system provides the underlying architecture, functions, and services necessary for clustering. Since its initial debut, clustering support has expanded to provide additional support for switchable independent ASPs, operational environments, and storage area network (SAN) technologies through separately priced options and licensed programs. Table 1. Product IBM software support for clustering Provides 5722-SS1 or 5761-SS1, IBM i - Base Core clustering functions via cluster resource services Enables management of maintenance switchover and failure recovery at the application level de monitoring and system failure detection 5722-SS1 or 5761-SS1, option 41 - HA Switchable Resources 5761-HAS, IBM System i High Availability Solutions Manager (HASM) Available on IBM i 6.1 or higher Support for switchable independent ASPs and using Management Central GUI in iseries Navigator Support for mirrored storage-based data replication technologies such as metro and global mirroring 1. Most CL commands for cluster control GUI interfaces within IBM Systems Director Navigator for i5/os for Cluster Resource Services and HASM 1. Additional IBM software is required for implementations which include metro or global mirroring Cluster resource services IBM Cluster Resource Services is part of the base operating system. Cluster resource services provides the integrated services and application programming interfaces (APIs) necessary to create and manage a cluster. This includes: Heartbeat monitoring - Heartbeat monitoring ensures that each node (system) in the cluster is active. At regular intervals, each active node in the cluster conveys that it is active by sending a signal to its adjacent nodes. Each node expects an acknowledgment to the heartbeat it sent out as well as an incoming heartbeat from the adjacent node. If a node misses sending a heartbeat for a predetermined 18

19 Components of the IBM clustering framework number of consecutive heartbeats, a heartbeat failure is signaled. Cluster resource services determines what event to initiate after considering the role of the failing node and whether the failure can be confirmed by a distress message. If the failure cannot be confirmed, cluster resource services will partition the cluster. Reliable messaging - The reliable messaging function keeps track of all nodes within a cluster and ensures that all nodes have consistent information about the state of cluster resources. Any status change for a node is broadcast along with a reason code. Retry and timeout values determine how many times a message can be sent to a node before signaling a failure or partition event. More time is allowed on remote networks. Switchover administration - Cluster resource services maintains the hierarchy of each node when a switchover or failover occurs. The hierarchy, called the recovery domain, determines which node assumes the role of the primary node. Distributed activities - Distributed activities provide the synchronization of actions across the nodes, or a subset of nodes, in a cluster to ensure that all of the nodes affected by the action are involved and that results are consistently reflected across the cluster. Parallel jobs - A set of parallel jobs are used to control the cluster, resources defined to the cluster, perform user and exit program requests, and interact with subsystems for highly available applications. APIs - Application programming interfaces (APIs) provide the ability to create clusters, add or remove nodes, and create and manage the system objects which identify groups of cluster resources. IP address takeover - The IP address takeover function allows access to an application or device without regard to the system on which the application is running or to where the device is varied on. A floating IP address is switched from the primary node to a backup node without requiring the re-configuration of clients. IP takeover is a key component in providing application resiliency and device resiliency. Resiliency support - Application, data, and device resiliency depend upon cluster resource group (*CRG) system objects. Cluster resource services provides the ability for users and programs to allocate resources to and manage these objects. The characteristics of messaging and heartbeat monitoring can be adjusted to match the performance of the network. Support for logical replication System i clustering does not support logical replication directly and is not aware of whether data is resilient or not. Middleware software from high availability business partners (HABPs) can be used in a clustering environment to provide data resiliency. Clustering uses data CRGs as the means to interact with software products that perform logical replication using journaling techniques supported by the IBM i operating system. 19

20 Clustering Overview Support for switchable device resources IBM i option 41 (HA Switchable Resources) provides the ability to manage the information necessary to switch access to independent auxiliary storage pools (independent ASPs) from one node to another through device domains. Independent ASPs are those numbered from 33 to 256. IBM i option 41 must be installed with a valid license key before this support can be used with a device CRG. A device domain is a subset of nodes in a cluster that share device resources, or the logical resources associated with the devices, and which can participate in a switching action. All nodes in a device domain need information about the included resources so that no conflicts occur when the devices are switched. For example, for a collection of switched disks, the independent disk pool identification, disk unit assignments, and virtual address assignments must be unique across the entire device domain. A cluster node can belong to only one device domain. A node must be defined as member of device domain before it can be added to the recovery domain for a device CRG. All nodes in a recovery domain for a device CRG must be in the same device domain. des can be added to and removed from device domains as needed. Figure 7 shows an example of a device domain within a cluster. A device CRG identifies nodes A and B as the domain which can share an independent disk pool. de C is not part of the device domain. The devices in the disk pool are accessible (varied on) on de A. When a switchover occurs the devices are varied off on node A, then made available (varied on) on node B. Figure 7. Device domain example. Support for resilient operational environments Clustering supports the use of a cluster administrative domain to maintain a consistent operational environment across nodes in a cluster. Applications often require specific system settings or other environmental conditions collectively known as an operational environment. This may include configuration 20

Clustering concepts parameters or data, user profiles, job descriptions, as well as system values, network attributes, system environment variables, and subsystem descriptions.

21 Clustering concepts parameters or data, user profiles, job descriptions, as well as system values, network attributes, system environment variables, and subsystem descriptions. Within a high availability environment, the operational environment must be the same on every node where an application can run or store its data. The resources identified to a cluster administrative domain, called monitored resource entries (MREs), are also identified in an associated peer CRG. The cluster administrative domain monitors the resources for changes and synchronizes any changes across the active domain. Once the domain is created, normal CRG functions are used to manage it. Each node can be defined in only one cluster administrative domain within the cluster. Figure 8. Cluster administration domain example Figure 8 shows an example of a four-node cluster with a cluster administrative domain. Each node is an LPAR, with two LPARs in each system. LPAR 1 is the normal production system. The node roles shown are those of the peer CRG associated with the domain. de roles for application CRGs and data CRGs used in the normal production environment are not shown. Clustering concepts Recovery domain This topic describes significant constructs which clustering uses to identify and control cluster resources. These constructs and the concepts associated with them are fundamental to any clustering discussions and appear in user interfaces. Each CRG defines its own recovery domain. The recovery domain identifies the current role and preferred role of each node within the CRG. While the current role of a node may change, the preferred role is identified when the CRG is created. The recovery domain also determines the order in which nodes can become the primary access point in the event of an outage. des can have the following roles: 21

22 Clustering Overview Cluster events Primary - The node is the primary access point for the resources associated with the CRG. Only one primary node is allowed. For an application CRG, this is the node where the application is currently running. For a data CRG, this node contains the principle copy of the resources. For a device CRG, this node is the current owner of the devices in the CRG. The primary role is not supported for a peer CRG. Backup - A backup node will take over the role of the primary access point for resources associated with the CRG in the event of an outage on the primary node. A backup node either contains a copy of the resources that is kept current by replication software, or contains the IP takeover address and activation instructions necessary to access the resources. The recovery domain determines the sequence in which the exit program will attempt to activate backup nodes during a switchover or failover. This role is not supported for peer CRGs. Replicate - A replicate node contains a copy of the resources associated with the CRG, but cannot participate in a switchover or failover. Replicate nodes are optional. They are reserved for those systems which are either not powerful enough to host applications, are used for queries and reports only, or are perhaps used as a data warehouse server. For peer CRGs, nodes defined as replicate represent an inactive access point. While any CRG type can be defined and managed through MIMIX Global, data CRGs are associated with replication activities. Clustering identifies over twenty cluster events that affect the ability of a node to participate in a cluster. On each node, cluster resource services monitors for and detects these events, to which the CRG exit programs on all nodes respond. The current and preferred roles of a node determine how cluster resource services and CRG exit programs on each node respond to the event. The following events have a significant effect on maintaining availability: failover, switchover, and partition. Failover A failover occurs when cluster resource services responds to a failure of a primary node by switching the access point for cluster resources normally accessed from that node to the first available backup node. For each CRG in which the failing node is the primary node, the access point to the CRG resources is switched to a backup node according to the recovery domain. If the first backup node is not available, the next backup node in the recovery domain is used. When multiple CRGs are involved in a failover, device CRGs are processed first, the data CRGs second, followed by application CRGs. If the cause of the failover is resolved, the failover can be cancelled. The CRG message queue provides the mechanism to cancel a failover. 22

23 Clustering concepts Switchover A switchover differs from a failover in how the request is initiated. A switchover is a user request, via a program or the cluster manager interface, to switch the primary access point for resources in a CRG from the specified node to a backup node. The backup node to use is determined by the current recovery domain. Switchovers are typically requested in order to perform system maintenance, such as applying program temporary fixes (PTFs), installing a new release, upgrading the system, or to test the switching process. The relationships between CRGs must be considered when specifying the order in which to switchover multiple CRGs. Figure 9. Simple switchover example in a two-node cluster, illustrating IP takeover for applications Partition A partition occurs when a node loses contact other nodes and cluster resource services cannot confirm that the node failed. When a partitioned state is in effect, cluster resources restricts some the of actions that can be performed by CRG exit programs within the partition. Partitions are typically caused by communications problems or by a system failure that was not confirmed by a distress message. Rejoin Rejoin is the term used for the process of a node becoming an active member of a cluster after having been a non-participating member. A rejoin ensures that CRGs (specifically, the *CRG object) are identical on all active recovery domain nodes. 23

24 Clustering Overview System distress messages On each node, the operating system will broadcast a distress message when it detects that the system is about to fail or is being shut down. Cluster resource services will respond to a distress message broadcast on the cluster communications network. Examples of when a node would initiate a distress message include, but are not limited to, ending all subsystems or those in which cluster jobs or exit programs reside, when a delayed power down is requested from the Hardware Management Console (HMC), or when UPS battery is drained while operating on backup power. It the failing node is a primary node, cluster resource services will initiate a failover. If the node is not the primary node, the node is ended and no longer participates in the cluster until user action is taken. Resilient applications Resilient applications are those which have the ability to be automatically ended on one node, switched to another node, and started without requiring manual reconfiguration by users. Application resiliency is based on IP address takeover and depends on the use of an application CRG. The application must be able to recognize the loss of the Internet Protocol (IP) connection between the client and the server. During a switchover, the client application must be aware that the IP connection will be temporarily unavailable and must retry access rather than ending. During a failover, the application must recognize that the IP connection is not available and respond to the error condition by ending normally. Application resilience enables better utilization of data resilience. To be considered resilient, an application must also provide the following: An application CRG exit program that handles cluster events. Automated data areas that identify information necessary to set up a resilient environment for the application and its associated data. One or more object specifier file (OSF) that identify the objects associated with the application that must be made resilient. 24

25 Common terms used throughout this document CHAPTER 4 Introduction This document has two main purposes: 1. To document basic operational guidelines and procedures 2. To provide detailed documentation specific to using MIMIX Global Where applicable, this document includes detailed operational, audit, and switching procedures for your availability solution. These procedures are the best practices recommendations that are the result of customer feedback to Certified MIMIX Consultants on the Vision Solutions services team. Common terms used throughout this document The following terms used in this document are defined as follows: backup node - A backup node will take over the role of the primary access point for resources associated with the cluster resource group (CRG) in the event of an outage on the primary node. A backup node contains a copy of the resources that is kept current by replication software. The recovery domain determines the sequence in which the exit program will attempt to activate backup nodes during a switchover or failover. The role of backup node is not supported for peer CRGs. cluster - A cluster is a collection of interconnected complete computers that work together as a single unified computing resource. The cluster is made up of one or more cluster nodes and is identified by a name comprised of 10 or fewer characters. cluster resources - Cluster resources are the resources that are required to be highly available by your business and are available to the nodes within a cluster. Cluster resources can be either moved or replicated to one or more nodes within a cluster. Examples include applications, data libraries, devices, and disk units. Resources are identified in cluster resource groups and controlled through cluster resource group exit programs. cluster resource group (CRG) - A cluster resource group is an IBM i system object that identifies a collection of cluster resources to be monitored and managed as a single unit. Each CRG defines the relationship between the nodes associated with those resources in a recovery domain that determines role of each node in the CRG as well as the degree to which each can participate in events such as synchronizing or performing a recovery action. Several types of CRGs are available. Each of the following CRG types is designed for a specific type of cluster resource: application, data, device, and peer. CRG exit program - Each CRG has a CRG exit program that is called on each active node in the CRG s recovery domain in response to a cluster event. The exit program manages cluster events for the environment established by the CRG. All possible cluster events have a pre-determined response in the exit program code. Broadcast replication - A broadcast replication configuration consists of three or 25

26 Introduction more nodes where a single source node feeds two or more target nodes. For example, in a three-node broadcast replication, system A is the source node to both system B and system C. Cascade replication - A cascade replication configuration consists of three or more nodes in series. For example, a three node cascade replication starts with system A as the source node for system B. System B is the source node for System C. Data group set - A data group set is the total number of data groups needed to enable replication between all nodes in a cluster. The first part of the three-part name of each data group in the set is the same. t all of the data groups in the set will be active at the same time. de - A node refers to one of two or more logical system definitions that make up a valid replication instance. For non-lpar systems, a node represents the entire system footprint. For LPAR systems, a node represents one of the LPAR partitions. Peer node - A node identified as peer has no order within the recovery domain. The peer role is only supported by peer CRGs. The access point to the resources in the peer CRG is controlled by the cluster management application. Primary node - This node is the primary access point for the resources associated with a CRG. Only one primary node is allowed. For an application CRG, this is the node where the application is currently running. For a data CRG, this node is the source of data for the resources to be replicated. For a device CRG, this node is the current owner of the devices in the CRG. The role of primary node is not supported for a peer CRG. Recovery domain - A recovery domain identifies the current role of each node within the CRG. The recovery domain also determines the order in which nodes can become the primary access point in the event of an outage. Each CRG defines its own recovery domain. Replicate node - A replicate node contains a copy of the resources associated with the CRG, but cannot participate in a switchover or failover. Replicate nodes are optional. They are reserved for those systems which are either not powerful enough to host applications, are used for queries and reports only, or are perhaps used as a data warehouse server. For peer CRGs, nodes defined as replicate represent an inactive access point. Replication instance - A replication instance refers to a group of nodes that make up your replication environment. Simple replication - A simple replication configuration consists of two nodes, a source node (primary) and a target node (backup). Effect of data group sets on controlling logical replication By definition, a data group is a MIMIX construct used to control the logical replication of data between two nodes (systems). Clustering environments usually involve more 26

27 Effect of data group sets on controlling logical replication than two nodes. In a clustering environment with three or more nodes, multiple data groups must be configured to ensure that data can flow between any nodes in the cluster. The total number of data groups needed to enable replication between all nodes in a cluster is known as the data group set. In clusters with three or more nodes, at least one data group within the data group set is disabled at any given time. The data groups associated with the current primary node are enabled and data groups associated with only backup nodes are disabled to ensure that data from the primary node can be replicated to only the expected nodes. Disabled data groups are associated with backup nodes. Only one user journal can be identified as the source of replication for a data group. Replicating from a second journal requires a second data group. Similarly, in a clustering environment, each source user journal is associated with a data group in a data group set. Replication from multiple source journals on a node requires multiple data group sets. When starting or ending logical replication in a clustering environment, it may be necessary to invoke more than one command request to ensure that all of the selected processes for all data groups on a node have been addressed. This is because the three-part name of a data group definition only identifies systems, not system roles within the data group or node roles within the cluster. MIMIX can determine the role of a specified system within a data group but it cannot determine whether what you specify will select all processes for all data groups on a specific node. You can either determine the source system of each data group that includes the node you want and tailor your command requests accordingly, or you can adopt the practice of always invoking two requests which specify the data group definition as follows: First request: DGDFN (*ALL *ALL node) Second request: DGDFN (*ALL node *ALL) Data group set examples The following examples illustrate how data group set affects procedures for data group activity. Data group set example - Table 2 shows a three-node cluster that is configured to replicate from two unique source journals. Two sets of data groups are required, one for each journal. Each data group set contains the necessary data groups to allow any node in the cluster to become the primary node. Table 2 illustrates these concepts: Since only one node can be the primary node, data groups defined between nodes which do not include the primary node must be disabled. In this example, one data group in the set must be disabled at any given time. The primary node cannot be determined from just the three-part name of a data group definition. The three-part name only identifies systems, not system roles 27

28 Introduction within the data group or node roles within the cluster. Table 2. Example of a data group set. Cluster des Data Group Sets Name System1 System2 DG1 A B DG1 A C DG1 B C Name System1 System2 DG2 B A DG2 C A DG2 C B Working with target-side only replication processes example - When resolving data group issues, at times it may be appropriate to start or end only the processes which run on the affected node. The key concept to remember is to consider the entire node, not just a data group. While MIMIX commands for starting and ending data groups (STRDG and ENDDG) permit specifying only source or target processes, these commands are not nodeaware; that is, they cannot ensure that the specified processes will be acted upon on all data groups on a specific node. Within a data group, the STRDG or ENDDG command can determine replication roles of each system (source or target) by evaluating the data group s data source parameter (DTASRC) but the commands are not aware of the cluster node roles (primary, backup, or replicate). Therefore, you may need to request the STRDG or ENDDG command multiple times to achieve the expected results. Consider a cluster which has the data group set identified in Table 2. For this example, node A is the primary node and you want to take action on replication processes on node B. Data groups DG1 B C and DG2 C B are disabled. Table 3 illustrates that it is better to issue multiple command requests than to not have all the appropriate processes which run on a node selected. Each row shows a variation for specifying the data group definition (DGDFN) on a request scoped to only target processes PRC(*ALLTGT). When only one command is used, either row is not sufficient to ensure that all target processes on node B are ended. The Result column illustrates the variations in results due to which system is considered source by the data group. It also illustrates that the cluster role of primary node does not necessarily correlate to the data group role of source system. At times, such as following a switch, the primary node role and data group source role are not the same system. Using two commands corresponding to the two rows in Table 3 will ensure that all of the appropriate processes are selected for action. 28

Availability Implementing high availability

System i Availability Implementing high availability Version 6 Release 1 System i Availability Implementing high availability Version 6 Release 1 Note Before using this information and the product it