HPE NonStop X Cluster Solution Manual Part Number: 828081-003b Published: April 2017 Edition: L15.08 and subsequent L-series RVUs
Contents About This Document...5 Supported Release Version Updates (RVUs)... 5 Intended Audience... 5 New and Changed Information for the 828081-003 Edition...5 Publishing History... 5 NonStop X Cluster Solution (NSXCS) Overview... 6 Hardware Requirements for NonStop X Cluster InfiniBand (IB) Solution... 8 Technical Document for NonStop X NS7 Systems... 10 Software Overview for NonStop X Cluster Solution... 11 OSM Package... 11 Expand and Message System Traffic...11 SCF Manageability for NonStop X Cluster Solution...12 NonStop Kernel Message System... 14 Software Requirements for NonStop X Cluster Solution... 14 Installation Tasks... 16 Related Documentation... 16 Prerequisites... 16 Service Provider Required Checklist for Nodes and Connections...18 SCF Commands... 19 Cluster Connectivity Subsystem SCF Commands...19 Summary of SCF Kernel Subsystem Commands to Manage CCMON and MSGMON Processes... 19 DCTMON Subsystem SCF Commands...20 Summary of SCF Kernel Subsystem Commands to Manage the DCTMON Processes...21 Checking Operations... 22 Checking the Operation of the NonStop X Cluster Solution... 22 Using OSM to Check Operations...22 Checking the External Fabric for All Nodes... 22 Checking the Operation of Each Node... 22 Checking That Automatic Line-Handler Generation is Enabled...23 Checking CCMON, MSGMON, and DCTMON... 23 Checking the Cluster Connectivity Subsystem... 24 Checking the NonStop X Cluster Solution Configuration...24 Checking the Internal X and Y Fabrics... 25 Checking the Operation of the Expand Processes and Lines...26 Checking $NCP, $ZEXP and $ZZWAN... 26 Checking the Status of Expand-Over-IB Line-Handler Processes...26 Checking the Status of An Expand-Over-IB Line... 27 2 Contents
Changing the NonStop X Cluster Solution Topology... 28 Using OSM to Add a Node to a NonStop X Cluster Solution...28 Prerequisites for InfiniBand Cable Installation... 29 Performing the OSM Adding a Node to a NonStop X Cluster Guided Procedure...29 Performing an RVU Upgrade on a Node...29 Removing a Node From a NonStop X Cluster Solution... 30 Replacing IB Cluster Switches in a NonStop X Cluster Topology...31 Starting and Stopping Cluster Connectivity Processes and Subsystems... 32 Stopping the Cluster Connectivity Subsystem... 32 Using SCF...32 Using the OSM Service Connection...32 Stopping Expand-Over-IB Lines... 32 Starting the Cluster Connectivity Subsystem...33 Starting the External IB Fabric... 33 Starting MSGMON... 34 Starting CCMON... 34 Starting DCTMON... 34 Starting the Expand-Over-IB Line-Handler Processes and Lines...34 Websites... 35 Support and other resources...36 Accessing Hewlett Packard Enterprise Support... 36 Accessing updates...36 Customer self repair...37 Remote support... 37 Warranty information...37 Regulatory information...38 Documentation feedback... 38 Troubleshooting... 39 Using OSM to Suppress NonStop X Cluster Solution Alarms... 39 Restoring Connectivity to a Node... 39 Switching the CCMON Primary and Backup Processes...40 Starting Required Processes and Subsystems...40 Accessing SCF Help for Error Messages... 40 Fallback Procedures... 41 Fallback to a Previous RVU... 41 Fallback to Previous SPRs...42 Collecting Required After Image (AI) Information for Diagnosing Problems...42 Required AI to Troubleshoot Cluster Connectivity Issues... 42 Required AI to Troubleshoot Perceived Performance Problems...43 Required AI to Troubleshoot Process Abends... 43 Required AI to Troubleshoot Processor Halts... 43 Cables... 44 Contents 3
Configuring MSGMON, CCMON, and DCTMON... 45 ZCMNCONF TACL Macro...45 Run the ZCMNCONF Macro... 46 ZDCTCONF TACL Macro... 46 Run the ZDCTCONF Macro...46 Alternatively Creating an SCF Command File... 47 Configure MSGMON... 47 Configure CCMON... 48 Configure DCTMON...48 Start CCMON, MSGMON, and DCTMON...49 Expand Lines Form...50 4 Contents
About This Document This guide describes the HPE NonStop X Cluster Solution and provides planning, installation, operations, and troubleshooting information. Supported Release Version Updates (RVUs) This publication supports the L15.08 RVU and all subsequent L-series RVUs until otherwise indicated in a replacement publication. Intended Audience This guide is written for those responsible for planning, installing, and operating a NonStop X Cluster Solution. All installation and replacement procedures related to a cluster must be performed by authorized service providers who have completed Hewlett Packard Enterprise training courses. New and Changed Information for the 828081-003 Edition This edition contains new information about RoCE clustering for HPE Virtualized NonStop. Publishing History Part Number Product Version Publication Date 828081-001 N.A. August 2015 828081-002 N.A. June 2016 828081-003 N.A. March 2017 About This Document 5
NonStop X Cluster Solution (NSXCS) Overview The NonStop X Cluster Solution provides clustering to: HPE Integrity NonStop X NS7 systems (zones) in interconnected zone topologies. These interconnected zones use InfiniBand (IB) to pass information between nodes functioning as one large processing entity and topology. For optimal fault tolerance, each node is connected along each of two IB fabrics (X and Y). Virtualized NonStop in interconnected zone topologies. These interconnected zones use RDMA over Converged Ethernet (RoCE) for fabric connectivity of data traffic between Virtualized NonStops and Virtualized CLIMs. For optimal fault tolerance, each RoCE cluster is connected along each of two RoCE fabrics (X and Y). Table 1: Features of NonStop X System IB Clusters Supported system type Supported RVU Cluster Connectivity subsystem for cluster monitoring and operations Expand-over-IB clustering option Supported zone configurations Maximum total nodes for zone configurations NonStop X IB cluster switches per zone (a zone is comprised of node(s) and an HPE NonStop X IB cluster switch pair) Maximum distance between a node and an IB cluster switch Maximum distance between IB cluster switches without IB link extenders Maximum distance between NS7 nodes in same zone Maximum distance between NS7 nodes in different zones NS7 L15.08 and later for NS7 X1 and long-haul topology L16.05 and later for NS7 X2 Managed and monitored by the Cluster Connectivity monitor (CCMON) process Expand-over-IB coexists with Expand-over-IP 1 zone referred to as single-zone or 2 or 3 zones referred to as multi-zone 3 zones is the maximum 24 nodes maximum 24 nodes can be in 1, 2, or 3 zones Two are required for each zone; one switch for each IB fabric to ensure fault tolerance 30 meters 30 meters 60 meters (short-haul interzone) 90 meters (short-haul intrazone) Table Continued 6 NonStop X Cluster Solution (NSXCS) Overview
Maximum distance between NonStop X IB cluster switches with IB link extenders and DWDMs Provides parallel implementation of frequently used $ZNUP functionality 65 kilometers (long-haul topology) Via the Parallel Remote Destination Control Table Monitor (DCTMON) process Table 2: Features of Virtualized NonStop RoCE Clusters Supported system Supported RVUs for RoCE clustering Cluster Connectivity subsystem for cluster monitoring and operations Expand-over-IB clustering option Supported zone configurations Maximum total zones for zone configurations Maximum distance between a node and an RoCE switch Maximum distance between zones Virtualized NonStop L17.02 and later Managed and monitored by the Cluster Connectivity Monitor (CCMON) process Expand-over-IB coexists with Expand-over-IP Consult your HPE representative Consult your HPE representative 100 meters (at a 406 bps data rate) Consult your HPE representative Long haul support Not supported for L17.02 Provides parallel implementation of frequently used $ZNUP functionality Monitor (DCTMON) process NOTE: For information about clustering the HPE Integrity NonStop NS7 CG X2 system, contact your HPE representative. That topology is not documented in this manual. NonStop X Cluster Solution (NSXCS) Overview 7
CAUTION: All procedures in this manual must be performed by authorized service providers. Hardware Requirements for NonStop X Cluster InfiniBand (IB) Solution HPE NonStop X IB Cluster Switch The NonStop X IB Cluster switch is an InfiniBand FDR 36-port managed switch. A pair of NonStop X IB cluster switches is required and form the foundation of each zone in the topology. These NonStop X IB cluster switches connect each node within a zone along one fabric. Every node connects to the NonStop X Cluster Solution through a pair of 4X FDR InfiniBand links (one link per fabric). Each node has up to 56 Gbps bidirectional bandwidth per fabric. The connections are through copper cables or active optical cables with QSFP transceiver. NS7 System Internal Blade IB Switch Ports 1-8 and ports 9-16 in the internal Blade IB switch are used for Cluster I/O Module (CLIM) direct connections. Port 17 is reserved for NonStop Application Direct Interface (NSADI). Port 18 is used for IB clustering with NonStop X IB cluster switches. More information Technical Document for NonStop X NS7 Systems on page 10 Replacing a NonStop Blade IB Switch, NonStop IO Expansion IB Switch, or NonStop IB Cluster Switch in a NonStop X System (service providers only) NonStop X NS7 Planning Guide IB Link Extenders for Long-Haul Topology The NonStop X Cluster long-haul topology requires customer-supplied, third-party IB link extenders. Link extenders implement the appropriate buffering and credit extension logic for long-haul links. An example topology is shown below. For information about link extenders, refer to the third-party documentation that comes with the product. 8 Hardware Requirements for NonStop X Cluster InfiniBand (IB) Solution
Dense Wavelength Division Multiplexers (DWDM) for Long-Haul Topology If one or more of these conditions apply, the NonStop X Cluster long-haul topology requires third-party DWDMs or optical multiplexers: Distance between cluster zones is greater than what is supported by the link extender WAN optical transceivers. Access to long-haul is through DWDM channels leased from a telecommunications provider. Desire to use optical multiplexing to lower costs (for example, share long-haul links with other data traffic such as Ethernet or Fibre Channel). For information about the DWDMs, refer to the third-party documentation that comes with the product. An example topology is shown below. NonStop X Cluster Solution (NSXCS) Overview 9
Hardware Requirements for RoCE Clustering for Virtualized NonStop The requirements for RoCE clustering are: RoCE clustering on Virtualized NonStop requires the use of 40 Gpbs RDMA over Converged Ethernet (RoCE) switches. For L17.02, switches used for RoCE clustering do not support a long-haul topology. Technical Document for NonStop X NS7 Systems Each new NS7 system includes a detailed Technical Document that provides information about the nodes in the NonStop X Cluster Solution topology such as: Rack included with the system and each enclosure installed in the rack Rack U location at the bottom edge of each enclosure Each cable with source, destination, connector type, cable part number, and connection labels 10 Technical Document for NonStop X NS7 Systems
TIP: It is important to retain all NS7 system records in an Installation Document Packet, including the Technical Document for your system and any configurations forms. To add CLIM configuration forms to the packet, ask your service provider to copy and complete the forms in the CLuster I/O Module (CLIM) Installation and Configuration Manual (L15.02+) Technical Document for Virtualized NonStop RoCE Cluster Switches NOTE: There is no technical document for Virtualized NonStop system hardware, including RoCE switches. Software Overview for NonStop X Cluster Solution OSM Package on page 11 Expand and Message System Traffic on page 11 SCF Manageability for NonStop X Cluster Solution on page 12 Cluster Connectivity Subsystem on page 12 NonStop Kernel Message System on page 14 Software Requirements for NonStop X Cluster Solution on page 14 OSM Package The HPE Open System Management (OSM) product is the primary manageability tool for the NonStop X Cluster Solution. OSM works together with the Cluster Connectivity subsystem and the Cluster Endpoint Database (CEPDB) to manage the topology. Once your service provider has verified your NonStop X Cluster Solution topology using Prerequisites on page 16, the service provider will log onto the OSM Service Connection and use the Adding a Node to NonStop X Cluster guided procedure and its online help to create your configuration. Expand and Message System Traffic In a NonStop X Cluster Solution, Message-System traffic flows directly between processors by way of the Message System but under the control of Expand. As of the L15.08 RVU, the Expand product supports the option called Expand-over-IB for NS7 systems participating in a NonStop X Cluster Solution topology. Beginning with the L17.02 RVU, the Expand-over-IB line handler applies to both NonStop X IB clusters and Virtualized NonStop RoCE clusters. This option provides high-performance inter-node communication with minimal processor cost and message latencies. This illustration diagrams the Message System traffic. Note that for L17.02 and later, Expand-over-IP is needed to connect NonStop X InfiniBand clusters with Virtualized NonStop RoCE clusters. Software Overview for NonStop X Cluster Solution 11
Secure Message System traffic between processes on different nodes travels through the Expand-over-IB line handlers, and through the local message system between the communicating processes and the line handlers. Nonsecure message-system traffic flows directly between processors through the intersystem Message System connections as directed by the Network Routing Table (NRT) under the control of Expand. The intersystem Message System connections are not used if the appropriate settings are not made in the NRT. SCF Manageability for NonStop X Cluster Solution You can use the Subsystem Control Facility (SCF) to configure, control, and display information about configured objects within each subsystem. Each subsystem responds to and processes SCF commands that affect that subsystem. For more information about the SCF commands refer to SCF Commands on page 19. The NonStop X Cluster Solution uses the new Cluster Connectivity Subsystem on page 12 and also introduces the DCTMON subsystem on the L-series RVUs. Both are configured and controlled using SCF commands. Note, however, that OSM is the primary manageability tool for the NonStop X Cluster Solution. Cluster Connectivity Subsystem The Cluster Connectivity subsystem manages and monitors NonStop X Cluster Solution operations on each node. Nodes join and leave the NonStop X Cluster when you start and stop Cluster Connectivity services on the nodes. Each node monitors its own connectivity with other nodes in the NonStop X Cluster. The identifiers for the subsystem are: Cluster Connectivity Subsystem Identifiers CMN ZCMN Function CMN is the mnemonic name for the subsystem and appears in event identifiers such as: TANDEM.CMN.L01. Appears at the beginning of EMS event token names such as: ZCMN-EVT-NODISCOVERY. Table Continued 12 SCF Manageability for NonStop X Cluster Solution
Cluster Connectivity Subsystem Identifiers $ZZCMN $ZZKRN.#ZZCMN Function Process name for the CCMON process pair (see CCMON following). $ZZCMN is the Cluster Connectivity subsystem monitor process. Generic process name of the Cluster Connectivity subsystem monitor process as configured under $ZZKRN. CCMON, MSGMON, and DCTMON Descriptions This table describes the NonStop X Cluster Solution processes supported on L15.08 and later. If these processes are started on L15.02, they generate a version mismatch EMS message and gracefully terminate. For information on configuring and starting these processes, refer to Configuring MSGMON, CCMON, and DCTMON on page 45. CCMON CCMON is a persistent process pair and is the focal point for NonStop X Cluster Solution management and monitoring. Function Subsystem Programmatic Interface (SPI) server for subsystem management commands. Monitors and responds to cluster-related events. Exists on every node in a NonStop X cluster. Can be replaced without requiring a system load. If it fails on a system with two processors or more, a new CCMON process pair is immediately started by the Persistence Manager process ($ZPM). Supports PTRACE for analysis of internal CCMON and MSGMON traces. MSGMON is bundled with CCMON. CCMON must be configured and started before adding a node to the NonStop X Cluster Solution. MSGMON The Message System Monitor (MSGMON) process is a monitor process that resides in each processor and executes various cluster-related functions required by the Message System. Function Helper for the CCMON subsystem monitor Handles communications between the primary CCMON and individual processors Can be replaced without requiring a system load Runs with a process name of $ZIMnn in each CPU, where nn is the CPU number. MSGMON must be configured and started before adding a node to the NonStop X Cluster Solution. Table Continued NonStop X Cluster Solution (NSXCS) Overview 13
DCTMON 1 The Parallel Remote Destination Control Table Monitor process (DCTMON) improves the performance of remote DCT access operations. These operations typically precede remote file open operations. Function Implements a subset of the remote DCT access operations supported by the Network Utility Process ($ZNUP). Provides a parallel implementation of frequently used $ZNUP functionality. Handles incoming remote DCT access requests sent directly via the NonStop X Cluster Solution without involving Expandover-IB line handlers. Runs with a process name of $ZDMnn in each CPU, where nn is the CPU number. 1 Unlike CCMON and MSGMON, DCTMON processes are not required to establish connectivity among nodes in a NonStop X Cluster Solution. DCTMON processes implement an optimization that improves the performance of certain cluster operations such as remote file opens. NonStop Kernel Message System The Message System provides the same set of privileged Application Programming Interfaces (APIs) for inter-processor communication between processes residing on the same or on a different node. The Message System implements a common communication protocol stack to exchange messages between processors in the same node or between different nodes connected via the NonStop X Cluster Solution. Software Requirements for NonStop X Cluster Solution The software products below are required and must have these minimum release levels. Table 3: Software for NonStop X Cluster Solution (L15.08 and later L-Series RVUs) Software Component Software Required Minimum Expand-over-IB profile T0958 T0958 L02 CCMON/MSGMON T0942 T0942 L01 DCTMON T1000 T1000 L02 OSM Service Connection Suite T0682 T0682 L02 InfiniBand (IB) Cluster License T0953 T0953 L01 1 1 T0953L01 is required for L15.08 and L16.05 only. For L17.02 and later L-series RVUs, cluster licensing is consolidated in the core license and T0953L02 is required. Refer to the NonStop Core Licensing Guide for details. 14 NonStop Kernel Message System
Table 4: Software for RoCE Cluster on Virtualized NonStop (L17.02 and later RVUs only) Software Component Software Required Minimum Expand-over-IP profile T0958 T0958 L02 CCMON/MSGMON T0942 T0942 L01^AAB DCTMON T1000 T1000 L02 OSM Service Connection Suite T0682 T0682 L02^BAI NOTE: For information about NonStop core licensing for NonStop X systems and Virtualized NonStop systems, see the latest edition of the NonStop Core Licensing Guide. In addition to the products above: The NonStop X Cluster Solution must be enabled in the core license. T0953 with a minimum version T0953L02 must be installed for Virtualized NonStop and RoCE clusters. NonStop X Cluster Solution (NSXCS) Overview 15
Installation Tasks CAUTION: Only HPE authorized service providers can perform the procedures described in this manual. Related Documentation OSM Adding a Node to a NonStop X Cluster Guided Procedure online help or the PDF version of this online help available on Servce Access Workbench (SAW) NonStop X NS7 Planning Guide Virtualized NonStop Deployment and Configuration Guide Prerequisites Service providers, before adding the first node or subsequent nodes to a NonStop X Cluster Solution zone, you must verify these pre-requisites are met. Service Providers must verify that all nodes that are being added meet these prerequisites In L17.02 and later L-series RVUs: NonStop X Cluster Solution must be enabled in the core license. In lieu of clustering enablement in the core license, each node may use the T0953 license described below. If clustering is not enabled in the core license, contact the license manager help desk (license.manager@hpe.com) to request a new license. In L15.08 and L16.05, each node requires an installed NonStop X Cluster license file (T0953). The license file is in either of these subvolumes: $SYSTEM.SYSTEM or $SYSTEM.SYSnn If the NonStop X Cluster license file is not installed, it must be ordered from QMS, and copied to either of the above subvolumes. Software Requirements for NonStop X Cluster Solution on page 14 must be met. If software installation is required on any nodes, follow all the installation instructions in each SPR's softdoc to install the SPR on the nodes. L15.08 or later is required for InfiniBand clustering. L17.02 is required for Virtualized NonStop RoCE clustering. Table Continued 16 Installation Tasks
Service Providers must verify that all nodes that are being added meet these prerequisites The hardware requirements described in Hardware Requirements for NonStop X Cluster InfiniBand (IB) Solution on page 8 or Hardware Requirements for RoCE Clustering on page 11 for Virtualized NonStop must be met. All required cables for the customer s cluster configuration must be on hand. If necessary, order the required switches and cables. Service providers: Refer to the QMS tech doc for specific details of the customer s configuration including connections and cable types. CAUTION: For InfiniBand clusters only, do not connect any InfiniBand cables. To avoid outages, InfiniBand cabling steps must be followed very carefully under the guidance of the OSM Add Note to NonStop X Cluster guided procedure. Expand manager process ($ZEXP) and the network control process ($NCP) must be started in the nodes. Check the status of these processes by using TACL commands: STATUS $ZEXP STATUS $NCP Ensure that $ZZWAN is running. You can check the $ZZWAN process status by using this TACL command: STATUS $ZZWAN TIP: This prerequisite only applies if you are manually configuring Expand-over-IB lines. Skip this if you prefer to use OSM Automatic Line Handler Configuration. Expand-over-IB lines are planned for each node to be added. When planning these lines, make sure that these defaults are appropriate for your installation: The default prefix used to name the line-handler processes The default CPU assignment for each line-handler process Proceed to Service Provider Required Checklist for Nodes and Connections on page 18 Installation Tasks 17
Service Provider Required Checklist for Nodes and Connections CAUTION: For InfiniBand clusters only, the service provider must review important documentation and carefully follow InfiniBand cabling instructions prior to adding nodes to an InfiniBand cluster. Review Checking the Operation of Each Node on page 22 before adding nodes and after adding nodes to an InfiniBand cluster. To ensure cluster connectivity, service providers need to review the online help within the OSM Adding a Node to a NonStop X Cluster guided procedure or the PDF version of this online help available on SAW, Adding a Node to a NonStop X Cluster Guided Procedure. CAUTION: For InfiniBand clusters only, do not connect the InfiniBand fiber-optic cables until the Adding a Node to a NonStop X Cluster guide procedure instructs you to do so. All connections must be made one fabric at a time. Failure to adhere to these cautions could result in serious InfiniBand cluster problems. Perform the Adding a Node to a NonStop X Cluster guided procedure and then return to this manual as needed. Determine if the automatic line-handler configuration is enabled in the nodes: Check for the presence of $ZZKRN.#OSM-CONFLH-RD. If it is present, the automatic line-handler configuration is enabled. Check the status of the routing distributor ($ZOLHD). If this process is running, automatic linehandler configuration is enabled. Although the routing distributor is running, the #OSM- CONFLH-RD process remains in a STOPPED state. 18 Service Provider Required Checklist for Nodes and Connections
SCF Commands This chapter describes the SCF commands for three separate subsystems: Cluster Connectivity Subsystem SCF Commands on page 19 Summary of SCF Kernel Subsystem Commands to Manage CCMON and MSGMON Processes on page 19 for managing the CCMON and MSGMON processes DCTMON Subsystem SCF Commands on page 20 Cluster Connectivity Subsystem SCF Commands You can use SCF to manage the Cluster Connectivity subsystem. The SCF command interface for the Cluster Connectivity subsystem is simpler with fewer options than in cluster subsystems found in RVUs prior to L-series RVUs. You can also use the OSM Service Connection to manage the Cluster Connectivity subsystem. For DCTMON subsystem commands, refer to DCTMON Subsystem SCF Commands on page 20. You must be running the L15.08 RVU or later to access Cluster Connectivity SCF commands. To access the help for Cluster Connectivity SCF commands, type SCF HELP CMN. This tables lists the supported commands for the Cluster Connectivity subsystem. Table 5: Supported SCF Commands for Cluster Connectivity Subsystem Command \ Object Type NULL PROCESS SUBNET SUBSYS CONN ALTER INFO X X PRIMARY X START X STATUS X X X STOP X TRACE X VERSION X X X X X X = The command currently supports this object type. Summary of SCF Kernel Subsystem Commands to Manage CCMON and MSGMON Processes This table summarizes the commands by the action performed and the command s syntax. Type SCF HELP KERNEL PROCESS for more information on NonStop Kernel subsystem commands. SCF Commands 19
If you want to View the configuration attributes of the $ZZCMN process pair Alter the configuration attributes of the $ZZCMN process pair Start the $ZZCMN process pair Stop the $ZZCMN process pair View the configuration attributes of the MSGMON process Alter the configuration attributes of the MSGMON process Start the $ZIMnn MSGMON processes Stop the $ZIMnn MSGMON processes Use this command SCF INFO PROCESS $ZZKRN.#ZZCMN,DETAIL SCF ALTER PROCESS $ZZKRN.#ZZCMN SCF START PROCESS $ZZKRN.#ZZCMN SCF ABORT PROCESS $ZZKRN.#ZZCMN SCF INFO PROCESS $ZZKRN.#MSGMON,DETAIL SCF ALTER PROCESS $ZZKRN.#MSGMON SCF START PROCESS $ZZKRN.#MSGMON SCF ABORT PROCESS $ZZKRN.#MSGMON For information on configuring the $ZZCMN and $ZIMnn processes using the preferred TACL macro method, refer to Configuring MSGMON, CCMON, and DCTMON on page 45. DCTMON Subsystem SCF Commands NOTE: For more details on the DCTMON functionality, refer to Support Note S14037: Introduction of Product T1000 DCTMON. DCTMON is also supported on J- and H-series RVUs. The DCTMON processes ($ZDMnn) must be started before DCTMON SCF commands can be processed. To access the help for DCTMON SCF commands, type SCF HELP DCT. This table lists the supported DCTMON subsystem SCF commands. Table 6: Supported SCF Commands for DCTMON Subsystem Command \ Object Type NULL PROCESS STATS TRACE X X VERSION X X X = The command currently supports this object type. 20 DCTMON Subsystem SCF Commands
Summary of SCF Kernel Subsystem Commands to Manage the DCTMON Processes This table summarizes the commands by the action performed and the command s syntax. Type SCF HELP KERNEL PROCESS for more information on NonStop Kernel subsystem commands. If you want to View the configuration attributes of the DCTMON process Alter the configuration attributes of the DCTMON process Start the $ZDMnn DCTMON processes Stop the $ZDMnn DCTMON processes Use this command SCF INFO PROCESS $ZZKRN.#DCTMON,DETAIL SCF ALTER PROCESS $ZZKRN.#DCTMON SCF START PROCESS $ZZKRN.#DCTMON SCF ABORT PROCESS $ZZKRN.#DCTMON For information on configuring the $ZDMnn process using the preferred TACL macro method, refer to Configuring MSGMON, CCMON, and DCTMON on page 45. Summary of SCF Kernel Subsystem Commands to Manage the DCTMON Processes 21
Checking Operations This chapter provides procedures for checking the operations of the NonStop X Cluster Solution in a NonStop X InfiniBand cluster or a Virtualized NonStop RoCE cluster. Checking the Operation of the NonStop X Cluster Solution From each node in a NonStop X InfiniBand cluster or Virtualized NonStop RoCE cluster, check the operation of the NonStop X Cluster Solution as soon as you complete any of these procedures: Installing a node Replacing a NonStop X Cluster IB switch or Virtualized NonStop RoCE switch Replacing a switch within a node Using OSM to Check Operations You can use the OSM Service Connection and associated online help to quickly check the health of each switch and node in the NonStop X Cluster Solution. If you find problems with a node, resolve the problems before adding additional nodes to the NonStop X Cluster Solution. Checking the External Fabric for All Nodes Perform this procedure after: Installing a node in an InfiniBand or RoCE cluster Replacing a NonStop X Cluster IB switch or Virtualized NonStop RoCE switch Replacing a switch within a node NOTE: You must configure remote passwords before you can use the STATUS SUBNET $ZZCMN, PROBLEMS command to gather information about remote nodes. Use SCF to check if direct communication is possible on both fabrics between all nodes in the cluster. This command queries the CCMON processes in all nodes, and displays whether connectivity along the X and Y fabrics is up for each node. You do not have to issue separate commands for each node. 1. Log on to any node and check for connectivity problems in the cluster: STATUS SUBNET $ZZCMN, PROBLEMS 2. For each node that should be active, verify that the display shows No connectivity problems detected. 3. If the display shows a list of nodes next to a particular node, a problem has occurred with the connection between that node and the nodes in the list. In that case, issue SCF STATUS SUBNET $ZZCMN in each problem node to get more details about the connectivity issues affecting those nodes. For more details on SCF STATUS SUBNET $ZZCMN command options, type SCF HELP CMN STATUS SUBNET. Checking the Operation of Each Node The topics in this section describe operation tasks before and after a node installation. Checking That Automatic Line-Handler Generation is Enabled on page 23 Checking CCMON, MSGMON, and DCTMON on page 23 Checking the Cluster Connectivity Subsystem on page 24 22 Checking Operations
Checking the NonStop X Cluster Solution Configuration on page 24 Checking the Internal X and Y Fabrics on page 25 Checking the Operation of the Expand Processes and Lines on page 26 Checking the Status of Expand-Over-IB Line-Handler Processes on page 26 Checking the Status of An Expand-Over-IB Line on page 27 Checking That Automatic Line-Handler Generation is Enabled Use one of these methods to determine if the automatic line-handler generation is enabled in the nodes: Check for the presence of $ZZKRN.#OSM-CONFLH-RD by entering SCF INFO PROCESS $ZZKRN.#OSM-CONFLH-RD. If it is present, the automatic line-handler configuration is enabled. Check the status of the routing distributor ($ZOLHD). If this process is running (but in a STOPPED state), automatic line-handler configuration is enabled. Although the routing distributor is running, the #OSM-CONFLH-RD process remains in a STOPPED state. Checking CCMON, MSGMON, and DCTMON NOTE: Unlike CCMON and MSGMON, DCTMON processes are not required to establish connectivity among nodes in a NonStop X Cluster Solution. DCTMON processes implement an optimization that improves the performance of certain cluster operations such as remote file opens. For more information about DCTMON, refer to CCMON, MSGMON, and DCTMON Descriptions on page 13. Perform this procedure: Before adding a node to a NonStop X Cluster Solution If you cannot start the Cluster Connectivity subsystem If you cannot connect to a remote node If you cannot display information about the NonStop X Cluster Solution in the OSM Service Connection To avoid conflicts with OSM and the guided procedures, you must configure $ZZKRN.#ZZCMN, $ZZKRN.#MSGMON, and $ZZKRN.#DCTMON to be the symbolic names for CCMON, MSGMON, and DCTMON, respectively. Use the SCF interface to the Kernel subsystem to check that these processes are configured correctly and in the STARTED state: 1. Display a list of all currently configured generic processes: INFO PROCESS $ZZKRN.* 2. Check that the CCMON, MSGMON, and DCTMON processes are displayed in the *Name column with these required SCF symbolic names: Checking That Automatic Line-Handler Generation is Enabled 23
Program File Name Process Name Required SCF Symbolic Name CCMON $ZZCMN $ZZKRN.#ZZCMN MSGMON $ZIMnn $ZZKRN.#MSGMON DCTMON $ZDMnn $ZZKRN.#DCTMON 3. If the processes are not configured with the required names, configure each process as described in Configuring MSGMON, CCMON, and DCTMON on page 45. 4. Check that CCMON, MSGMON, and DCTMON are configured correctly: INFO PROCESS $ZZKRN.#ZZCMN, DETAIL INFO PROCESS $ZZKRN.#MSGMON, DETAIL INFO PROCESS $ZZKRN.#DCTMON, DETAIL Compare the configured modifiers for CCMON, MSGMON, and DCTMON with the recommended modifiers shown in Configuring MSGMON, CCMON, and DCTMON on page 45. 5. Check that the processes are started: STATUS PROCESS $ZZKRN.#ZZCMN STATUS PROCESS $ZZKRN.#MSGMON STATUS PROCESS $ZZKRN.#DCTMON 6. If any of these are not started, refer to Starting and Stopping Cluster Connectivity Processes and Subsystems on page 32. Checking the Cluster Connectivity Subsystem At an SCF prompt, type the following command to check if the Cluster Connectivity subsystem is started on the node: STATUS SUBSYS $ZZCMN If necessary, issue the SCF START SUBSYS $ZZCMN command to start the Cluster Connectivity subsystem in the node. Checking the NonStop X Cluster Solution Configuration The SCF STATUS CONN $ZZCMN command displays the status of the local node's connections to the IB cluster fabrics and displays the same values whether or not the DETAIL option is specified. At an SCF prompt, type the following command to check the configuration of the NonStop X Cluster Solution: SCF STATUS CONN $ZZCMN An example display of this command for a NonStop X system node using the L-series RVUs L15.08 through L16.05 is as follows: \TEST.$SYSTEM.SYSTEM> SCF STATUS CONN $ZZCMN STATUS CONNECTION IB Cluster Configured Expand Node Number 101 IB Cluster License File 24 Checking the Cluster Connectivity Subsystem
$SYSTEM.SYS01.ZLT953IB The status fields for the STATUS CONN display are: IB Cluster Indicates if the local node is configured for connectivity to a cluster. Expand Node Number Identifies the Expand node number of the node. IB Cluster License File Lists the location of the T0953 license file on the local node that authorizes clustering. This cluster licensing method has been supported since the L15.08 RVU. However, beginning with L17.02, it is preferred that cluster licensing be provided through core licensing. An example display of this command for those using the L17.02 RVU or later is as follows: \TEST.$SYSTEM.SYSTEM> SCF STATUS CONN $ZZCMN STATUS CONNECTION Clustering Configured Expand Node Number 63 Cluster License File $SYSTEM.SYS01.ZLT953IB Cluster Enablement in Core License Yes The status fields for the STATUS CONN display are: Clustering Indicates if the local node is configured for connectivity to a cluster. Expand Node Number Identifies the Expand node number of the node. Cluster License File Lists the location of the T0953 license file on the local node that authorizes clustering. This cluster licensing method has been supported since the L15.08 RVU. However, beginning with L17.02, it is preferred that cluster licensing be provided through core licensing. Cluster Enablement in Core License Indicates whether clustering is enabled in the core license. This cluster licensing method is first supported starting with the L17.02 RVU. Checking the Internal X and Y Fabrics If a problem occurs on an internal fabric on any node in a NonStop X IB cluster, check if OSM shows any alarms on internal Blade IB switches and NonStop X system processors on the problem node. To check the fabric s status, you can use the SCF STATUS FABRIC $ZFAB command. Checking the Internal X and Y Fabrics 25
Checking the Operation of the Expand Processes and Lines NOTE: Expand over-ib line-handler processes and lines are used in both IB and RoCE clusters. $NCP, $ZEXP, and $ZZWAN must be running before you add a node to a NonStop X Cluster Solution topology. The Expand-over-IB line-handler processes and lines are configured and started after the node is added. Checking $NCP, $ZEXP and $ZZWAN Perform this procedure before running the OSM Adding a Node to a NonStop X Cluster guided procedure. Process TACL Name Required SCF Symbolic Name Expand manager $ZEXP $ZZKRN.#ZEXP Network control process $NCP Not applicable WAN subsystem manager process $ZZWAN $ZZKRN.#ZZWAN 1. Check that $ZEXP, $NCP, and $ZZWAN are started. At a TACL prompt: STATUS $ZEXP STATUS $NCP STATUS $ZZWAN 2. Check that the $NCP ALGORITHM modifier is set to the same value on all nodes and on all Expand nodes that communicate with the NonStop X Cluster Solution. At an SCF prompt: INFO PROCESS $NCP, DETAIL If these processes are not properly configured or not started, refer to the Expand Configuration and Management Manual. Checking the Status of Expand-Over-IB Line-Handler Processes Perform this procedure after you have added a node to a NonStop X Cluster Solution or if you cannot start an Expand-over-IB line. NOTE: Expand-over-IB applies to both NonStop X IB Clusters and RoCE clusters. To check the status of the line-handler processes, either: At an SCF prompt, type: STATUS DEVICE $ZZWAN.* Perform the Add Node to NonStop X Cluster action to determine if lines are configured and the state of the lines. For more information, see the Adding a Node to a NonStop X Cluster Guided Procedure online help. 26 Checking the Operation of the Expand Processes and Lines
Checking the Status of An Expand-Over-IB Line NOTE: You can also check the status of an Expand-over-IB line using OSM. For details, see the OSM online help. Perform this procedure after you have added a node or if you are having problems with an Expand-over- IB line. NOTE: Expand-over-IB applies to both NonStop X IB clusters and RoCE clusters. 1. To list the Expand-over-IB lines: LISTDEV TYPE 63,7 The naming convention for Expand-over-IB lines is $IB<Expand node number>. For example, $IB035 is the Expand-over-IB line to Expand node number 035. 2. Check that the line is started and ready: STATUS LINE $IB035, DETAIL 3. Check that the path is started: STATUS PATH $IB035, DETAIL 4. For additional information about the lines: INFO PROCESS $NCP, NETMAP 5. If a line is STOPPED, NOT READY, or not in a STARTED state, issue SCF STATUS SUBNET $ZZCMN to verify if connectivity to the remote node is up, and then start the line. For example: START LINE $IB035 Checking the Status of An Expand-Over-IB Line 27
Changing the NonStop X Cluster Solution Topology CAUTION: Only authorized service providers can perform the procedures described in this manual. Service providers can use these procedures to make changes to an already installed NonStop X Cluster Solution: Using OSM to Add a Node to a NonStop X Cluster Solution on page 28 Performing an RVU Upgrade on a Node on page 29 Removing a Node From a NonStop X Cluster Solution on page 30 Replacing IB Cluster Switches in a NonStop X Cluster Topology on page 31 Using OSM to Add a Node to a NonStop X Cluster Solution This table describes the OSM action that is used for adding a node to a NonStop X Cluster. OSM Action Add Node to NonStop X Cluster 1 Function Launches the Adding a NonStop X Cluster guided procedure. Indicates whether or not line-handler processes are configured between the local node and remote nodes and shows the line states. Allows you to configure and start Expand-over-IB line-handler processes and Expand-over-IB lines from the local node to all remote nodes. Indicates when to connect the fiber-optic cables between the nodes and the NonStop X IB Cluster switches. Prompts you to start Cluster Connectivity cluster services by placing the NonStop X Cluster Solution SUBSYS object in the STARTED state and collects cluster information. Configures the Cluster Endpoint Database (CEPDB) in each node in the cluster. 1 IMPORTANT: This OSM action has required Prerequisites for InfiniBand Cable Installation on page 29. 28 Changing the NonStop X Cluster Solution Topology
Prerequisites for InfiniBand Cable Installation Service providers must ensure these prerequisites are met: All required planning tasks for adding a node are met. Refer to Prerequisites on page 16 and Service Provider Required Checklist for Nodes and Connections on page 18. Install the InfiniBand cabling one fabric at a time, after the Tech Doc has been carefully reviewed. CAUTION: Do not connect the InfiniBand fiber-optic cables until the guided procedure instructs you to do so. You have carefully reviewed the instructions in the Adding a Node to a NonStop X Cluster Guided Procedure online help or the PDF version of this help, Adding a Node to a NonStop X Cluster Guided Procedure, located on SAW. It is also advised that you have the online help or the PDF open during the procedure. Once all these prerequisites are met, proceed to Performing the OSM Adding a Node to a NonStop X Cluster Guided Procedure on page 29. Performing the OSM Adding a Node to a NonStop X Cluster Guided Procedure 1. In the tree pane of the OSM Service Connection, select the System object. Right-click and select Actions. 2. From the available actions drop-down list, select the Add Node to NonStop X Cluster. 3. Click Perform Action. Ensure you are using the documentation referred to in Prerequisites for InfiniBand Cable Installation on page 29. CAUTION: Do not connect the InfiniBand fiber-optic cables until the Adding a Node to a NonStop X Clusterguided procedure instructs you to do so. 4. In the guided procedure, click Start. Performing an RVU Upgrade on a Node This procedure describes the steps for an RVU upgrade in a node connected to the NonStop X Cluster Solution. TIP: Besides the RVU upgrade itself, this procedure: Minimizes EMS error messages on other nodes connected to the same cluster. Minimizes error-handling software overhead in other nodes connected to the same cluster, such as message retries to the node receiving the RVU upgrade. Allows for a graceful communication shut down to/from the node receiving the RVU upgrade. To perform an RVU upgrade on a node: Prerequisites for InfiniBand Cable Installation 29
1. Record the Expand-over-IB lines to be stopped. Refer to Expand Lines Form on page 50. 2. Shut down any applications using Expand-over-IB connections between the node and the rest of the cluster. 3. Stop the Expand-over-IB lines. Refer to Stopping Expand-Over-IB Lines on page 32. 4. On the node that will receive the RVU upgrade, stop the Cluster Connectivity subsystem. Refer to Stopping the Cluster Connectivity Subsystem on page 32. 5. Perform the RVU upgrade. 6. After the system load of the new RVU, issue these SCF commands to verify the $ZZCMN, $ZIMnn, and $ZDMnn (if present) processes are started: STATUS PROCESS $ZZKRN.#ZZCMN STATUS PROCESS $ZZKRN.#MSGMON STATUS PROCESS $ZZKRN.#DCTMON The above processes should start automatically after system load if they were configured as described in Configuring MSGMON, CCMON, and DCTMON on page 45. Otherwise, you can start these processes manually as described in Starting MSGMON on page 34, Starting CCMON on page 34, and Starting DCTMON on page 34. 7. To confirm that the Cluster Connectivity subsystem is started on the node that received the RVU upgrade, issue the SCF command: STATUS SUBSYS $ZZCMN 8. The Cluster Connectivity subsystem should start automatically after a system load, assuming the STARTSTATE attribute is configured as STARTED. For details on the STARTSTATE attribute, refer to the SCF Reference Manual for the Kernel Subsystem or type SCF HELP CMN ALTER SUBSYS. a. If necessary, display the STARTSTATE attribute by issuing the SCF command: INFO SUBSYS $ZZCMN b. If necessary, manually start the Cluster Connectivity subsystem by issuing the SCF command: START SUBSYS $ZZCMN 9. To confirm that connectivity between the node that received the RVU upgrade and all other nodes in the NonStop X Cluster has been restored, issue the SCF command: STATUS SUBNET $ZZCMN NOTE: It might take a few minutes for the node that received the RVU upgrade to restore Cluster Connectivity subsystem connectivity to all other nodes. a. The STATUS SUBNET $ZZCMN command also shows the state of the Expand-over-IB lines to/from the node receiving the RVU upgrade. The Expand-over-IB lines should restart automatically if the OSM Automatic Line Handler generation feature is enabled as described in Checking That Automatic Line-Handler Generation is Enabled on page 23. b. If necessary, you can restart the previously stopped Expand-over-IB lines. Refer to Starting the Expand-Over-IB Line-Handler Processes and Lines on page 34. Removing a Node From a NonStop X Cluster Solution To remove a node from a NonStop X Cluster Solution in a NonStop X IB cluster or Virtualized Nonstop RoCE cluster: 30 Removing a Node From a NonStop X Cluster Solution
1. Record the Expand-over-IB lines to be stopped. Refer to Expand Lines Form on page 50. 2. Shut down any applications using Expand-over-IB connections between the node and the rest of the cluster. 3. From the OSM Service Connection, perform the Place Local Node in Service action on the Cluster object. This will suppress dialouts from the other nodes while the node is being removed from the cluster. 4. From the OSM Service Connection, issue the Remove Node from Cluster action on the Local Node object. This will remove the node from the Cluster Endpoint Database (CEPDB) configuration on each node in the cluster. 5. Stop the Expand-over-IB lines. Refer to Stopping Expand-Over-IB Lines on page 32. 6. On the node you are removing, stop the Cluster Connectivity subsystem. Refer to Stopping the Cluster Connectivity Subsystem on page 32. 7. Stop the $ZZCMN, $ZIMnn, and $ZDMnn processes on the node you are removing: SCF ABORT PROCESS $ZZKRN.#ZZCMN SCF ABORT PROCESS $ZZKRN.#MSGMON SCF ABORT PROCESS $ZZKRN.#DCTMON 8. Delete the $ZZCMN, $ZIMnn, and $ZDMnn processes on the node you are removing: SCF DELETE PROCESS $ZZKRN.#ZZCMN SCF DELETE PROCESS $ZZKRN.#MSGMON SCF DELETE PROCESS $ZZKRN.#DCTMON 9. Stop and delete Expand-over-IB line handler processes on the node you are removing. 10. From SCF, issue the commands WAN STOP DEVICE and HELP WAN DELETE DEVICE for more information. 11. You must disconnect the cables one fabric at a time. a. Disconnect the cable from the X fabric cluster switch. b. For a NonStop X IB cluster only, wait 1 minute before disconnecting the other cable. c. Disconnect the other cable from the Y fabric cluster switch. Replacing IB Cluster Switches in a NonStop X Cluster Topology Service providers should refer to the replacement instructions in the Replacing a NonStop Blade IB Switch, NonStop IO Expansion IB Switch, or NonStop IB Cluster Switch in a NonStop X System service procedure. Replacing IB Cluster Switches in a NonStop X Cluster Topology 31
Starting and Stopping Cluster Connectivity Processes and Subsystems When the Cluster Connectivity subsystem is stopped on a node, the node informs other nodes that it is leaving the NonStop X Cluster. Stopping the Cluster Connectivity subsystem destroys internode connectivity between the node that receives the command and all other nodes over both external fabrics. The NonStop X cluster monitor process $ZZCMN itself does not stop but remains an active process. Stopping the Cluster Connectivity Subsystem NOTE: Hewlett Packard Enterprise recommends that you do not stop and restart the Cluster Connectivity subsystem as a method to repair cluster connectivity. Stopping the Cluster Connectivity subsystem is normally used prior to: Physically disconnecting a node Halting the processors on a node The DSM/SCM ZPHIRNM step when installing a new release version update (RVU) Aborting the CCMON process pair for the purpose of upgrading the CCMON software, unless recommended otherwise by Hewlett Packard Enterprise Using SCF 1. At an SCF prompt, stop the Cluster Connectivity subsystem: STOP SUBSYS $ZZCMN 2. Confirm that the Cluster Connectivity subsystem is stopped: STATUS SUBSYS $ZZCMN Using the OSM Service Connection 1. Right-click the OSM Cluster object, and select Actions. 2. From the Available Actions drop-down list, select Set Cluster Subsys State. 3. Click Perform Action. 4. From the drop-down list of NonStop X cluster states, select Stopped. 5. Click OK. The Action Status window shows the progress of the action. 6. Click Close to close the Actions dialog box. 7. From the details pane, select the Attributes tab. 8. From the Attributes tab, check that the state is stopped. Stopping Expand-Over-IB Lines Stopping Expand-over-IB lines is required when you add, move, or remove a node, or when you perform an RVU upgrade on a node. Use these SCF commands to identify and stop Expand-over-IB lines and also refer to the Expand Lines Form on page 50. 1. For the affected node, identify the currently configured Expand-over-IB lines. For example: 32 Starting and Stopping Cluster Connectivity Processes and Subsystems