High perfomance clustering using the 9125-F2C Planning and Installation Guide

Size: px

Start display at page:

Download "High perfomance clustering using the 9125-F2C Planning and Installation Guide"

Brenda Turner
5 years ago
Views:

1 Power Systems High perfomance clustering using the 9125-F2C Planning and Installation Guide Revision 1.3 p. 1 of 135

2 p. 2 of 135

3 Power Systems High perfomance clustering using the 9125-F2C Planning and Installation Guide Revision 1.3 p. 3 of 135

4 Note Before using this information and the product it supports, read the information in the Safety Notices section and in the IBM Systems Safety Notices manual, G , and the IBM Environmental Notices and User Guide, Z This edition applies to IBM Power Systems 9125-F2C servers that contain the POWER 7 processor Copyright IBM Corporation 2011.US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp p. 4 of 135

5 Table of Contents 1Safety notices...7 2Manage High performance clustering using the 9125-F2C Using the Cluster Guides Cluster Guide Revision History Clustering systems by using 9125-F2C Cluster information resources General cluster information resources Cluster hardware information resources Cluster management software information resources Cluster software and firmware information resources Hardware information Physical packaging information Hub Quad chip module (QCM) Octant Drawer Disk enclosure Frame/Rack Cluster configuration Compute nodes Disk enclosures and GPFS I/O nodes Utility nodes Cluster management hardware Diskless Nodes Cluster management Cluster Service and Management Networks Systems management Network management Hardware management Event management Software configuration Subsystem configuration Barrier-synchronization register Power 775 Availability Plus Overview Impacts of node resource failures Spare Policy Job management policies with A Actions in response to A+ Resource failures A+ Responsibilities Disk Failures A+ Examples A+ and Installation Cluster planning Planning Overview Supported devices and software Planning the compute subsystem...64 p. 5 of 135

6 Operating system image Service Node to Compute Node Relationship LoadLeveler Performance Planning Compute Hardware Placement Planning Disk Storage Subsystem Planning Management Subsystem Overview Executive Management Server Service Utility Nodes HMCs LANs Service: Cluster Monitoring Planning NTP Planning Other Utility Nodes Login nodes Site specific utility nodes Planning HFI network Choosing a topology Planning Cabling Planning IP configuration for the HFI network Planning Security Other Planning Resources Site Planning Planning Power 775 Availability Plus Design Planning Installation Coordinating Installation Phased Installation Cluster Installation How to use Cluster Installation Overall Cluster Installation Installation Terminology Installation Responsibilities Installation Documentation Overview of Installation Flow Detailed installation procedures Pre-installation tasks Software installation F2C system hardware installation HFI network cabling Installation Other Installation Procedures Configure ISNM Using ISNM to verify the HFI network Diskless Node Logging Configuration Placement of LoadLeveler and TEAL GPFS Daemons Barrier Sync Register (BSR) Configuration Performance tuning Installation checklists Installation Responsibilities by Component Installation Responsibilities by Software Component Installation Responsibilities by Hardware Component Notices p. 6 of 135

7 1 Safety notices Safety notices may be printed throughout this guide: DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to people. CAUTION notices call attention to a situation that is potentially hazardous to people because of some existing condition. Attention notices call attention to the possibility of damage to a program, device, system, or data. World Trade safety information Several countries require the safety information contained in product publications to be presented in their national languages. If this requirement applies to your country, a safety information booklet is included in the publications package shipped with the product. The booklet contains the safety information in your national language with references to the U.S. English source. Before using a U.S. English publication to install, operate, or service this product, you must first become familiar with the related safety information in the booklet. You should also refer to the booklet any time you do not clearly understand any safety information in the U.S. English publications. German safety information Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne 2 der Bildschirmarbeitsverordnung geeignet. Laser safety information IBM servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs. Laser compliance IBM servers may be installed inside or outside of an IT equipment rack. p. 7 of 135

8 DANGER When working on or around the system, observe the following precautions: Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid a shock hazard: Connect power to this unit only with the IBM provided power cord. Do not use the IBM provided power cord for any other product. Do not open or service any power supply assembly. Do not connect or disconnect any cables or perform installation, maintenance, or reconfiguration of this product during an electrical storm. The product might be equipped with multiple power cords. To remove all hazardous voltages, disconnect all power cords. Connect all power cords to a properly wired and grounded electrical outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the system rating plate. Connect any equipment that will be attached to this product to properly wired outlets. When possible, use one hand only to connect or disconnect signal cables. Never turn on any equipment when there is evidence of fire, water, or structural damage. Disconnect the attached power cords, telecommunications systems, networks, and modems before you open the device covers, unless instructed otherwise in the installation and configuration procedures. Connect and disconnect cables as described in the following procedures when installing, moving, or opening covers on this product or attached devices. To Disconnect: 1. Turn off everything (unless instructed otherwise). 2. Remove the power cords from the outlets. 3. Remove the signal cables from the connectors. 4. Remove all cables from the devices To Connect: 1. Turn off everything (unless instructed otherwise). 2. Attach all cables to the devices. 3. Attach the signal cables to the connectors. 4. Attach the power cords to the outlets. 5. Turn on the devices. (D005) p. 8 of 135

9 DANGER Observe the following precautions when working on or around your IT rack system: Heavy equipment personal injury or equipment damage might result if mishandled. Always lower the leveling pads on the rack cabinet. Always install stabilizer brackets on the rack cabinet. To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest devices in the bottom of the rack cabinet. Always install servers and optional devices starting from the bottom of the rack cabinet. Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top of rack-mounted devices. Each rack cabinet might have more than one power cord. Be sure to disconnect all power cords in the rack cabinet when directed to disconnect power during servicing. Connect all devices installed in a rack cabinet to power devices installed in the same rack cabinet. Do not plug a power cord from a device installed in one rack cabinet into a power device installed in a different rack cabinet. An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of the system or the devices that attach to the system. It is the responsibility of the customer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock. CAUTION Do not install a unit in a rack where the internal rack ambient temperatures will exceed the manufacturer's recommended ambient temperature for all your rack-mounted devices. Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not blocked or reduced on any side, front, or back of a unit used for air flow through the unit. Consideration should be given to the connection of the equipment to the supply circuit so that overloading of the circuits does not compromise the supply wiring or overcurrent protection. To provide the correct power connection to a rack, refer to the rating labels located on the equipment in the rack to determine the total power requirement of the supply circuit. (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer brackets are not attached to the rack. Do not pull out more than one drawer at a time. The rack might become unstable if you pull out more than one drawer at a time. (For fixed drawers.) This drawer is a fixed drawer and must not be moved for servicing unless specified by the manufacturer. Attempting to move the drawer partially or completely out of the rack might cause the rack to become unstable or cause the drawer to fall out of the rack. (R001) p. 9 of 135

10 Removing components from the upper positions in the rack cabinet improves rack stability during relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a room or building: Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack cabinet. When possible, restore the rack cabinet to the configuration of the rack cabinet as you received it. If this configuration is not known, you must observe the following precautions: Remove all devices in the 32U position and above. Ensure that the heaviest devices are installed in the bottom of the rack cabinet. Ensure that there are no empty U-levels between devices installed in the rack cabinet below the 32U level. If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet from the suite. Inspect the route that you plan to take to eliminate potential hazards. Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to the documentation that comes with your rack cabinet for the weight of a loaded rack cabinet. Verify that all door openings are at least 760 x 230 mm (30 x 80 in.). Ensure that all devices, shelves, drawers, doors, and cables are secure. Ensure that the four leveling pads are raised to their highest position. Ensure that there is no stabilizer bracket installed on the rack cabinet during movement. Do not use a ramp inclined at more than 10 degrees. When the rack cabinet is in the new location, complete the following steps: Lower the four leveling pads. Install stabilizer brackets on the rack cabinet. If you removed any devices from the rack cabinet, repopulate the rack cabinet from the lowest position to the highest position. If a long-distance relocation is required, restore the rack cabinet to the configuration of the rack cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent. Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the pallet. (R002) (L001) (L002) p. 10 of 135

11 (L003) or All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1 laser products. Outside the U.S., they are certified to be in compliance with IEC as a class 1 laser product. Consult the label on each part for laser certification numbers and approval information. CAUTION: This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive, DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information: Do not remove the covers. Removing the covers of the laser product could result in exposure to hazardous laser radiation. There are no serviceable parts inside the device. Use of the controls or adjustments or performance of procedures other than those specified herein might result in hazardous radiation exposure. (C026) p. 11 of 135

12 Data processing environments can contain equipment transmitting on system links with laser modules that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical fiber cable or open receptacle. (C027) CAUTION: This product contains a Class 1M laser. Do not view directly with optical instruments. (C028) CAUTION: Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the following information: laser radiation when open. Do not stare into the beam, do not view directly with optical instruments, and avoid direct exposure to the beam. (C030) Power and cabling information for NEBS (Network Equipment-Building System) GR-1089-CORE The following comments apply to the IBM servers that have been designated as conforming to NEBS (Network Equipment-Building System) GR-1089-CORE: The equipment is suitable for installation in the following: Network telecommunications facilities Locations where the NEC (National Electrical Code) applies The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed wiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation from the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connect these interfaces metallically to OSP wiring. Note: All Ethernet cables must be shielded and grounded at both ends. The ac-powered system does not require the use of an external surge protection device (SPD). The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal shall not be connected to the chassis or frame ground. p. 12 of 135

13 2 Manage High performance clustering using the 9125-F2C You can use this information to guide you through the process of managing the 9125-F2C clusters. It is part of of a series of guides to High Performance clustering using 9125-F2C High performance clustering using the 9125-F2C Planning and Installation Guide. High performance clustering using the 9125-F2C Management Guide. High performance clustering using the 9125-F2C Service Guide. This document is intended to serve as a consolidation point for important information for managing, maintaining and monitoring IBM High Performance Computing (HPC) Cluster using POWER technology and the Host Fabric Interface with the 9125-F2C server. It will serve as a consolidation point for the documents of the many component and subsystems that comprise a cluster. It will aid the reader in navigating these other documents in an efficient manner. Where necessary, it will recommend additional information to a person who has general skills within the discipline of tasks that is being documented. This document is not intended to replace existing guides for the various hardware units, firmware, operating system, software or applications publications produced by IBM or other vendors. Therefore, most detailed procedures that already exist in other documents will not be duplicated here. Instead, those other documents will be referenced by this document, and as necessary guidance will be given on how to work with generic information and procedures in the other documents. The intended audiences for this document are: HPC clients, including: System, network and site planners System administrators Network administrators System operators Other Information Technology professionals IBM personnel Planner for the cluster and the site System Service Representatives 2.1 Using the Cluster Guides The document sections are roughly in the order in which you will need them. Reference the Table of Contents for an outline of the topics in this document. The following table is a guide to finding topics in the the High performance clustering using the 9125-F2C Cluster Guides: High performance clustering using the 9125-F2C Planning and Installation Guide. High performance clustering using the 9125-F2C Management Guide. High performance clustering using the 9125-F2C Service Guide. p. 13 of 135

14 Once directed to a document for a certain topic, it is good practice to become familiar with the table of contents as a detailed outline of sub-topics within the document. Content highlights Content Description Document Clustering systems A chapter which provides overview information. It is customized All Cluster Guides by using 9125-F2C within each guide. All of them have references to information resources, and a a brief overview of cluster components, as well as how to use the cluster guides. The planning and installation guide also has a detailed overview of the cluster, its subsystems, its components, and unique characteristics. Detailed Overview In depth overview of the cluster, its subsystems, its components, and Planning and unique characteristics. Installation Guide Planning This includes planning information and references for all major Planning and information subsystems and components in the cluster. Installation Guide Supported devices This is a list of supported devices and software at the time of Planning and and software publication. More up to date information is available in the HPC Central Installation Guide website. (see References) Planning Worksheets to help plan the Planning and worksheets Installation Guide Cluster Installation This includes: the following. References to other documentation are Planning and frequent. Cluster management Installation Guide Installation Responsibilities Overview of the Installation Flow Installation steps organized by teams and objectives Detailed Installation procedures for topics not covered elsewhere This includes: the following. References to other documentation are frequent. Introduction to cluster management A cluster management flow HFI Network Management Monitoring the cluster Availability Plus monitoring Data gathering for problem resolution Cluster maintenance Command references especially for HFI network management p. 14 of 135 Management Guide

15 Content Description Document Cluster service This includes: the following. References to other documentation are frequent. Service Guide Introduction to cluster service Tables to narrow down to proper procedures References to detailed procedures documented elsewhere Hardware problem isolation topics Software problem isolation topics Power 775 Availability Plus actions EMS failover references Cluster Guide Revision History The following outlines changes to the Cluster Guide. This includes changes across all of the individual guides that comprise the cluster guide. Table 1: Cluster Guide Revision History Revision Guide Changes 1 All Initial release 1.1 Planning and Installation Guide In the Overview: Diskless nodes: New section; information on updating the Cluster Service and Management Networks Updated LoadLeveler configuration information Section on Highly Available Service Nodes and LoadLeveler and TEAL GPFS Barrier-synchronization register overview Availability Plus info on improving system availability In Planning: Service and login nodes supported LoadLeveler Planning on Service Nodes Planning HIghly Available Service Nodes and LoadLeveler and GPFS TEAL monitoring and GPFS Typos and terminology cleanup Installation: Typos and grammar Installation Documentation links updated for TEAL; LoadLeveler Bringup of LoadLeveler in the installation flow Diskelss node logging configuration in the installation flow Placement of LoadLeveler and TEAL GPFS daemons in the installation flow Barrier Sync Register (BSR) Configuration in the installation flow p. 15 of 135

16 Revision Guide Changes Management Guide Changes to command references: chnwfence nwlinkdiag Service Guide 1.2 Add Start Here section Update User reported problems Update System Log reported problems Update Problem reported in TEAL Update Problem reported in Service Focal Point Update Problem reported in component or subsystem log Add Node will not boot Add Checking routing mode Add xcat problems Add TEAL Service Procedures Add Acting on HFI Network Alert Add Acting on HFI Network Serviceable Event Isolation Procedure updates and additions: HFI_DDG, HFI_LDG, HFINSFP, HFINNBR, HFILONG Add Data collection section Extensive HFI Network Locations updates Add HFI Network Event Reference Terminology updates All Add section for Cluster Guide Revision History. Planning and Installation Guide Power 775 Availability Plus Overview updates: Some numbered list format issues for the Availability Plus overview. Statement regarding when A+ Management begins Management Guide Command reference updates: Network Management commands: cnm.snap output info OS system commmand reference for network management chghfi changed to chdev add ifhf_dump for AIX add hfi_snap Service Guide 1.3 Data collection updates for: ISNM network HFI network or HFI driver specific to a node TEAL Management Guide Fix some typoes Add isrmon and isrperf commands to Command references p. 16 of 135

17 2.2 Clustering systems by using 9125-F2C Clustering systems by using 9125-F2C provides references to information resources, and a brief overview of cluster components.the cluster. The cluster consists of many components and subsystems, each with an important task aimed at accomplishing user work and maintaining the ability to do so in the most efficient manner possible. The following paragraphs introduce and briefly describe various subsystems and their main components. The compute subsystem consists of: The 9125-F2C systems configured as nodes dedicated to performing computational tasks. These are diskless nodes. Operating System images customized for compute nodes Applications The storage subsystem consists of: 9125-F2C systems configured as IO nodes dedicated to serving the data for the other nodes in the cluster. These are diskless nodes. Operating System images customized for IO nodes. SAS adapters in the 9125-F2C systems which are attached to Disk Enclosures Disk enclosures Global Parallel file system (GPFS ) The communications subsystem consists of: The Host Fabric Interface technology in the 9125-F2C Busses from processor modules to the switching hub in an octant. For more information see, Octant on page 16. Local links (LL-links) between octants in a 9125-F2C. For more information see, Octant on page 16. Local remote links (LR-links) between drawers in a SuperNode. Distance links (D-links) between SuperNodes The operating system drivers The IBM User space protocols AIX and Linux IP drivers The management subsystem consists of: Executive Management Server (EMS) running key management software. Different types of servers might be used. For more details, see <<to be added>>. Operating system on the EMS Utility Nodes used as xcat service nodes. These serve operating systems to local diskless nodes and provide a hierarchical access to hardware and nodes from the EMS console. Extreme Cloud Administration Toolkit (xcat) running on the EMS and service Utility Nodes. For information on xcat, go to xcat website. ( DB2 - running on the EMS and service Utility Nodes. p. 17 of 135

18 Integrated Switch Network Manager (ISNM) running on the EMS Toolkit for Event Analysis and Logging (TEAL) Other Utility nodes are customizable for each site. These utility nodes must be 9125-F2C servers. Login Nodes are required Other site unique nodes such as tape subsystem servers. These unique nodes are optional, but must be 9125-F2C systems. A summary of node types: Compute Nodes: provide computational capability. Compute nodes generally comprise most of the cluster. IO Nodes: provide connectivity to the storage subsystem. The number of IO nodes is driven by the amount of required storage. Utility Nodes provide unique functions. Service Nodes running xcat, which serve operating systems to local diskless nodes and provide a hierarchical access to hardware and nodes from the EMS console. These service nodes are required. Login nodes which provide a log-in gateway into the cluster. These login nodes are required. Other site unique nodes, such as tape subsystem servers. These unique nodes are optional, but must be 9125-F2C systems. Note: The EMS and HMC are considered to be management consoles and not nodes. Key concepts that are introduced with this cluster are: Most of the nodes are diskless and get their operating systems and scratch space served by the service Utility nodes. Diskless boot is performed over the HFI. Power 775 Availability Plus configuration for processors, switching hubs and HFI cables provides extra resource in the cluster to let these components to fail. And not be replaced until the cluster is nearing the possibility of not being able to achieve agreed upon workload capabilities Cluster information resources Cluster information resources provides references to information resources for the cluster, its subsystems and components. The following tables indicate important documentation for the cluster, where to get it and when to use it relative to Planning, Installation, and Management and Service phases of a clusters life. The tables are arranged into categories of components: General cluster information resources, on page 1 Cluster hardware information resources, on page 20 Cluster management software information resources, on page 20 Cluster software and firmware information resources, on page 21 p. 18 of 135

19 General cluster information resources The following table lists general cluster information resources: Table 2. General cluster resources Component IBM Cluster Information IBM Clusters with HFI and 9125-F2C website Document Power Systems High performance clustering using the 9125-F2C Planning and Installation Guide Power Systems High performance clustering using the 9125-F2C Management Guide Power Systems High performance clustering using the 9125-F2C Service Guide IBM HPC Clustering with Power 775 servers - Service Packs portion of the Plan Install X X Manage and service* M S X X X X X X X X IBM High Performance Computing Clusters Service Packs website ( BM+High+Performance+Computing+Clusters+Service+Packs#IBM HighPerformanceComputingClustersServicePacksIBMHPCClusteringwithPower775serversServicePacks) HPC Central wiki and HPC Central forum The HPC Central wiki enables collaboration between customers and IBM teams. This wiki includes questions and comments. IBM Fix Central display/hpccentral/hpc+central Note: IBM Fix Central should only be used in conjunction with the readme website for IBM Clusters with HFI and 9125-F2C, above. This is because Fix Central may contain code levels that have been verified for other environments, but not for this cluster * M = Management only; S=Service only; X=Both management and service. p. 19 of 135

20 Cluster hardware information resources The following table lists cluster hardware resources: Table 3. Cluster hardware information resources Component Document Site planning for all IBM systems System i and System p Site Preparation and Physical Planning Guides Site and Hardware Planning Guide POWER F2C system Plan Install Manage and service x x Installation Guide for 9125-F2C x Servicing the IBM system p 9125-F2C x PCI Adapter Placement x x Worldwide Customized Installation Instructions (WCII) IBM service representative installation instructions for IBM machines and features x Logical partitioning for all systems Logical Partitioning Guide x Install Instructions for IBM LPAR on System i and System P x IBM Power Systems documentation is available in the IBM Power Systems Hardware Information Center. Any exceptions to the location of information resources for cluster hardware as stated above have been noted in the table. Any future changes to the location of information that occur before a new release of this document will be noted in the HPC Central website Cluster management software information resources The following table lists cluster management software information resources: Table 4. Cluster management software resources Component Document Plan Hardware Management Console (HMC) Installation and Operations Guide for the HMC xcat For xcat documentation, go to xcat Install x Manage and service x Operations Guide for the HMC and Managed Systems x x x x x x x documentation ( /index.php?title=xcat_documentation) Integrated Switch Network Manager (ISNM) This document p. 20 of 135

21 Toolkit For Event Analysis and Logging (TEAL) On sourceforge.net: x x x eal/index.php?title=main_page When installed, on EMS: /opt/teal/doc/teal_guide.pdf IBM Power Systems documentation is available in the IBM Power Systems Hardware Information Center Cluster software and firmware information resources The following table lists cluster software and firmware information resources. Table 5. Cluster software and firmware information resources Component Document Plan Install Manage and service AIX Information Center x x x Linux Obtain information from your Linux distribution source x x x DB2 For information, go to DB2 x x x IBM HPC Clusters Software GPFS: Concepts, Planning, and Installation Guide x x AIX GPFS: Administration and Programming Reference x x GPFS: Problem Determination Guide x GPFS: Data Management API Guide x Tivoli Workload Scheduler LoadLeveler for AIX: Installation Guide (SC ) x x Tivoli Workload Scheduler LoadLeveler for Linux: Installation Guide (SC ) x x Tivoli Workload Scheduler LoadLeveler: Using and administering (SC ) x Tivoli Workload Scheduler LoadLeveler: Command and API Reference (SC ) x Tivoli Workload Scheduler LoadLeveler: Diagnosis and Messages Guide (SC ) x Tivoli Workload Scheduler LoadLeveler: Resource Manager Guide (SC ) Parallel Environment: Installation Parallel Environment: Messages x x x x x x Parallel Environment: Operation and Use, Volumes 1 and 2 x Parallel Environment: MPI Programming Guide x Parallel Environment: MPI Subroutine Reference x The IBM HPC Clusters Software Information can be found at the IBM Cluster Information Center. p. 21 of 135

22 p. 22 of 135

23 2.2.2 Hardware information Hardware information describes the hardware for clustering with the 9125-F2C. The core of the 9125-F2C cluster hardware solution is the 9125-F2C drawer with its innovative integrated network and the cluster storage drawer. A number of unique components comprise the drawer and each component contributes to the extreme scaling, performance, and reliability characteristics of the cluster solution. The hardware sets a new bar in the density, scaling, performance, reliability, energy efficiency, and infrastructure elimination in contrast with comparable solutions. The hardware is still a general-purpose super computing platform that does not require the applications to change the hardware. This feature permits earlier applications to benefit from this high performance system, with minimal changes. Moreover, permits the application developers to start using new programming models, like asynchronous PGAS, to write new applications that are ready to run on the systems. p. 23 of 135

Figure 1. POWER7 Chip Block Diagram POWER7 Chip description: 8 processor cores per POWER7 chip with each core supporting the 64-bit PowerPC architecture.

24 Figure 1. POWER7 Chip Block Diagram POWER7 Chip description: 8 processor cores per POWER7 chip with each core supporting the 64-bit PowerPC architecture. A scalar out-of-order micro-architecture, and our-way multithreading Core frequency: 3.5 GHz 3.85 GHz 12 execution units per core 2 Fixed Point Units 2 Load Store Units Double Precision Floating Point Units (FPUs) per core 2 FLOPS/Cycle/FPU 224 GFLOPs 246 GFLOPs 1 Branch 1 Conditional Register 1 Vector unit 1 Decimal Floating Point Unit 6 Wide Dispatch Units include distributed Recovery Function p. 24 of 135

25 Out of Order Execution 4 Way SMT per core (32 threads per POWER7 Chip) 32 KB L1 Instruction Cache per core 32 KB L1 Data Cache per core 256 KB L2 Cache per core 4 MB edram L3 per core Dual DDR3 Memory Controllers 100 GB/s Memory bandwidth per Chip 3 8B Intranode Buses (X, Y, Z) 1 bus at 8B + 8B with 2 extra bits per bus Bus speed is 3.0 Gb/s Used for both address and data 3 8B Intranode Buses (A, B, C) 1 bus at 8B + 8B with 2 extra bits per bus Bus speed is 3.0 Gb/s Used for data only 1 8B Intranode Bus to connect to the Hub Chip (W) 1 bus at 8B + 8B with 2 extra bits per bus Bus speed is 3.0 Gb/s Used for both address and data Physical packaging information The following table lists the physical packaging information used in the 9125-F2C clusters. Table 6. Physical Packaging Definitions used in 9125-F2C Definition Description Building Block Logical group of 3 racks consisting of 32 CPC drawers (8 SuperNodes) and disk enclosure. Rack Rack or cabinet assembly with power and cooling infrastructure SuperNode Physical grouping of 4 2U CPC (center processor complex) drawers connected all-to-all with LR-links Disk Enclosure Integrated storage drawer containing SAS disk drives CPC Drawer Center processor complex - 2U drawer of 8 QCMs, 128 memory slots, 8 hubs, and 17 PCIe slots Octant A nodes affinity domain of QCM, memory, Torrent Hub Chip, and 2 PCI slots QCM Quad-Chip-Module multi-level ceramic composed of4 POWER7 8-core chips Hub Torrent Chip and associated L-link and D-link interconnections PCIe Slot One of the 17 available adapter slots in a CPC drawer A node in this cluster is as an operating system image running on a single QCM with memory and a Hub. The hardware of a node contains the following elements: 4 processor chips and a Hub 32 processor cores and 16 memory DIMMS (dual inline memory modules) and a Hub p. 25 of 135

26 Figure 2. Compute Nodes A CPC Drawer contains eight logical nodes, and each logical element consists of the following elements: 8 QCMs, 128 memory DIMMS with 8 Hubs For a total of 256 processor cores, 8 Hubs, 128 DIMM slots, and 17 PCIe slots. A SuperNode contains four CPC drawers, and each logical element consists of the following elements: 32 QCMs, 512 memory DIMMS with 32 Hubs For a total of 1024 processor cores, 32 Hubs, 64 16x, and 4 8x PCIe slots The clusters high speed interconnects are connected in a two tier network hierarchy. Each hub on a node in this cluster has an Integrated Switch Router (ISR) that is physically connected to other ISRs within the cluster. These links between the hubs within a Super Node are called L-Local Buses (copper) and p. 26 of 135

27 L-Remote Links (optical). The L-links (L-Local and L-Remote) within a Super Node form the first level connection of a two level hierarchical, fully connected topology. The second level of the hierarchical interconnect is supported by a set of optical links between the hubs on different Super Nodes. Each hub in the system contains 10 of these optical links called as D-links. A Super Node with four CPCs, each containing 8 hubs with 10 D-Links per hub, has 320 D-links at the rear of each Supernode. These D-links are used to connect to other Super Nodes. A rack is the cabinet infrastructure that accommodates the CPC drawers and disk enclosures comprise a cluster configuration. It provides approximately 28 EIA units of space to accommodate CPCs and disk enclosures, in the center section of the rack and the following: Bare cabinet with casters Cover Set - two side covers, a front cover, and a rear cover with an integrated water to air heat exchanger Dual Bulk Power and Control Assemblies (BPCA) located in the top of the rack Water Conditioning Units (WCU) located at the bottom of each rack The rack can be populated in different configurations depending on the quantity of compute capability, I/O capability, and storage capability that is required. Building block: The 9125-F2C building block is a logical entity, which consists of three racks that permits simplified view of the cluster. It is used during physical planning, installation, and by Systems Management. For example, some projects have a building block consisting of eight Supernodes and either three or four storage enclosures. Three rack configuration building block An example of a three rack configuration building block is illustrated in the following figure. p. 27 of 135

28 Figure 3. Three rack building blocks The cluster components for this system are distributed as follows: Rack 1: Three SuperNodes Rack 2: Two SuperNodes and three Disk Enclosures Rack 3: Three SuperNodes Hub This section provides information about the 9125-F2C Hub. The hub implements the following functions: A cache coherent fabric controller with point-to-point connections to the four POWER7 chips Two 16x PCI-Express controllers with one 16x slot wired to each 16x controller, 16X PCI-Express Gen2 Channels. The first hub in the server contains three controllers to provide for a third PCI-Express slot. One 8x PCI controller in the Central processor complex (CPC) one of the eight Octants have this controller wired out to an 8x slot An interconnect interface controller for supporting network communication over MPI, IP, and Global Shared Memory (GSM) Robust hardware support for collective accelerations with integrated floating point units p. 28 of 135

29 A seven port L-Local Bus of 8B/8B at 3.0 GHz permitting point-to-point connections to seven other hubs within the 2U CPC. This bus permits any OS image within the CPC to access any PCI-Express I/O slots within the CPC A 24 ported L-Remote Link of 6X/6X, 10 GB fiber permitting connections to 24 hubs of the other three 2U CPCs of a SuperNode Up to 16 ported D-Link of 10X/10X, 10 GB fiber permitting connections to other SuperNodes Figure F2C Hub Hub chip brief description The Hub chip consists of the following elements: 2 Host Fabric Interface (HFI) Units 1 Collective Acceleration Unit (CAU) p. 29 of 135

30 1 Nest memory Management Unit (NMMU) 1 Integrated Switch (ISR) Global Counter Host fabric interface: The Host fabric interface (HFI) is analogous to an adapter and provides link between the host system and the network. By virtue of being on the same coherent bus as the POWER7 the HFI can source data from the processors L2 cache and can also add incoming data into the cache. The HFI interface is like the one in the adapter and has been improved by a more scalable RDMA capability and new functions like immediate send and remote atomics. The HFI support the following packet formats: FIFO mode (support 2 KB maximum packet size) Immediate send (128B packet using the new icswx instruction) IP FIFO mode IP with descriptors (support 2 KB maximum packet size) RDMA Full RDMA: Write, read, fence with completion (support 32 MB messages using up to 2 KB packets) Half RDMA: Write, read, with completion (support 2 KB maximum message size) Small RDMA: Read-modify-write with completion. Completion with fetch (2, 4, and 8 byte remote atomic updates including ADD, AND, OR, XOR, and Compare and Swap with and without Fetch) GUPS RDMA: Up to four GUPS requests can be included in a single 128B packet Collectives Multicast Reduce: Supports NOP, SUM, MIN, MAX, OR, AND, XOR Single precision (32 bit) and double precision (64 bit) signed and unsigned Signed and unsigned Fixed-point and floating-point Collective acceleration unit: The Collective acceleration unit (CAU) within the Torrent chip helps speed up the processing of a limited set of collective operations between a set of nodes. When a parallel job is run across a set of nodes, the software can set up a tree of a subset or all of those nodes, by including the processors where the job is running and a set of CAUs. The system can support multiple trees simultaneously and each CAU can be part of up to 64 independent trees. The software writing the locations (ISR IDs) of each CAUs neighbors in the tree, through an MMIO write, programs a tree. Processors interact with the CAUs by sending and receiving packets to the CAUs through the Send and Receive FIFO of an HFI. The CAU supports two basic types of collective operations, which are the multicast and reduce. Multicast A packet that is broadcast to all the nodes in a tree and can begin at any node Reduce In this operation, packets are collected from all nodes other than the root node and are combined into a single packet. The single packet is based on the specified reduction operation, which is p. 30 of 135

31 delivered to the root node. The root node itself is not part of the reduction operation in hardware and the local reduction, if required, then must be done in software. Any node in the tree can be arbitrarily designated as the root node. CAU features: There is one CAU per Torrent chip which is attached to four POWER7 chips on the QCM in the same octant as the Torrent chip Supported operations: Multicast Reduce: NOP, SUM, MIN, MAX, OR, AND, XOR Supported operand size and types: Single precision (32 bit) and double precision (64 bit), signed and unsigned, and fixed-point and floating-point Tree count: Each CAU supports 64 trees from the possible 16 million trees in the system (64 bit tree CAM with 34 bit tree IDs) Data field size: 64 bytes For each tree, the CAU supports up to 9 node neighbor IDs and 1 HFI window ID In flight transactions: 2, current and previous Nest memory management unit: The Torrent nest memory management unit (NMMU) provides address translation capabilities for the HFI units. For 16 GB pages, the device-driver maps them as 4 GB pages in the NMMU. There is a 6 K Address translation lookaside buffer (ATLB) cache in the NMMU per HFI. Assuming, a 4 GB per core standard configuration up to 75% of physical memory in a QCM can be mapped and be in the NMMU cache, if 16 MB pages are used. An application can map all 128 GB of physical memory in a QCM using 64 K pages, as the PHYP sets aside enough host memory to permit for this mapping. An application using 16 GB host pages and mapping them as 4 GB pages in the NMMU can map all 128 GB of physical memory in a QCM using only 32 ATLB entries, so guaranteeing that it does not incur the penalty of a cache miss. Integrated switch router: The Integrated switch router (ISR) unit in each Torrent replaces the external switches of a traditional fat tree network. It also provides significant cost and performance improvements over commodity offerings. The ISR supports a two-tier fully connected graph topology. In the first tier, 32 QCMs are connected to each other through L-links to form a Supernode. In the second tier, up to 512 Supernodes are connected to each other through D-links. As a further optimization, the 8 QCMs that are part of a single drawer are connected to each other through copper links (L-local). These copper links have higher bandwidth to accommodate extra PHYP coherency traffic across the drawer and I/O traffic between the QCMs. The copper links also provide a cost advantage by significantly reducing the optics in the system. One of the major advantages of the ISR network topology is the high radix switch that permits packets to traverse the network with a minimum number of hops when using direct routes. In addition to a direct route mode, the ISR network also supports an indirect route mode. This indirect route mode permits applications to benefit from the huge amount of bandwidth provided by the ISR network topology. Although the current generation of the hardware does not support adaptive routing the hardware provides for a software directed indirect route mode. The protocol layers and runtimes can use this route for p. 31 of 135

32 managing congestion and minimize out-of-order packet delivery while still using the higher performance benefits of indirect routes. For systems smaller than full size, the ISR network supports multiple D-links between Supernodes to further boost inter-supernode bandwidth. ISR Features: High radix 56 x 56 full crossbar Multiple virtual channels for deadlock avoidance Two-tier fully connected graph topology Multiple route modes: Hardware direct Hardware multiple direct (round robin) Hardware indirect (random or round robin) Software directed indirect Packet sizes: Minimum: 128 byte flit Maximum: 2 K Support for IP multicast Per port performance counters Packet trace facility for functional and performance debug Per flit CRC with link level retry Global counter: The Hub supports a synchronized global counter function across the network. This function is useful for application synchronization, performance analysis, debug, and is a key component of the cluster OS jitter mitigation strategy. The Global counter is a register in each ISR, which is shadowed in the Nest memory management unit (NMMU) to provide host MMIO read access. The Global counter is synchronized across the ISR network with a maximum drift and skew that is less than the minimum 0-byte packet latency Quad chip module (QCM) In the cluster hardware offering, four POWER7 chips are packaged on a single glass ceramic module, which is called a quad chip module. The QCM includes busses between the POWER7 chips. It also includes busses to off-module memory and an associated hub chip block providing connectivity within an octant. The following figure shows the QCM logic details with the inter-chip bandwidths and each chip memory bandwidths. p. 32 of 135

Figure 5. Quad chip module logic block 2.2.2.4 Octant The four POWER7 chips in the QCM are tightly coupled with the A, B, C and X, Y, Z busses and form a flat SMP.

33 Figure 5. Quad chip module logic block Octant The four POWER7 chips in the QCM are tightly coupled with the A, B, C and X, Y, Z busses and form a flat SMP. Each QCM is connected to a hub with 4 W busses (one from each POWER7) and together with its associated memory and PCIe slots forms an Octant. The Octant is one-eighth of a drawer and is commonly used by the software team as a node. The Octant is the maximum size supported LPAR so there are at least eight nodes in a single drawer. An Octant can be subdivided up to four LPARs for a maximum of 32 nodes in a single drawer. There are 16 p. 33 of 135

34 DIMM slots and two PCIe x16 Gen2 slots in an Octant. Octant 0 has an additional x8 Gen2 slot. The following figure shows the Octant logic block with its associated power supply, memory DIMMS, QCM, Hub with optics, and the PCIe busses. p. 34 of 135

35 Figure 6. Octant logic block p. 35 of 135

2.2.2.5 Drawer This topic provides layout information of a drawer.

36 Drawer This topic provides layout information of a drawer. The following figure shows the drawer layout, which has eight octants and the associated power components. For the first time, there is a redundant DCCA which provides unprecedented level of redundancy. p. 36 of 135

Figure 7. Drawer layout 2.2.2.6 Disk enclosure This topic provides information about the disk (storage) enclosure. The disk enclosure is also sometimes referred to as the storage enclosure.

37 Figure 7. Drawer layout Disk enclosure This topic provides information about the disk (storage) enclosure. The disk enclosure is also sometimes referred to as the storage enclosure. Internal Heat Exchanger 12 Internal Fans Drive Carrier 16x Port Card Redundant DCCA Cable Management Tray Figure 8. Disk enclosure The Disk enclosure contains the following: 376 SFF Disk Drives and 8 Solid State Drives (SSDs) (when fully configured) SFF Disk Drives are 10K rpm and 600GB 96 Drive Carriers holding 4 drives each 16 Port Cards in pairs with each pair powering and controlling 48 drives redundantly Port Cards have either 2 SAS 4x ports, or 4 SAS 4x ports (for chaining) 2 Distributed Conversion and Control Assemblies (DCCAs) 12 Fans 1 Water to Air Heat Exchanger, located internal to the chassis The Disk Enclosure supports the following Data Rates: Serial Attach SCSI (SAS) 6 GB 6.0 Gbps per lane (SAS Adapter in CEC to DE Port Card) 6.0 Gbps per lane (DE Port Card to Drives) 3.0 Gbps per lane (DE Port Card to SSDs) p. 37 of 135

2.2.2.7 Frame/Rack This topic provides the layout information of a drawer rack or frame.

38 Frame/Rack This topic provides the layout information of a drawer rack or frame. A fully loaded rack can support up to three Supernodes and Storage enclosure. This arrangement represents about 92 TFLOPS/s of compute capability and about 220 TB of raw storage capacity. Figure 9. Fully loaded rack p. 38 of 135

39 2.2.3 Cluster configuration Cluster configuration describes the configuration of subsystems and components for clustering with a 9125-F2C system. The following topics provide information about cluster configuration Compute nodes A compute node is a partition within a 9125-F2C server. A compute node is a single operating system image within an Octant, which is composed of a QCM, 128 GBytes of main memory and a hub. Each compute node produces 0.98 TFlops peak, configured as a 32 way processor shared-memory multiprocessor and is controlled by an OS image. There is 2560 total number of nodes in this configuration and most of them are used for demonstrating compute workloads. An itemized technical description of the sub components that make up the compute node is listed in the following information. A hub contains the following: A cache coherent fabric controller with point-to-point connections to the four POWER7 chips Two 16x PCI-Express controllers with one 16x slot wired to each 16x controller, 16X PCI-Express Gen2 Channels One 8x PCI controller in the Central processor complex (CPC) one of the eight Octants have this controller wired out to an 8x slot An interconnect interface controller for supporting network communication over MPI, IP, and Global Shared Memory (GSM) Robust hardware support for collective accelerations with integrated floating point units A seven ported L-Local Bus of 8B/8B at 3.0 GHz permitting point-to-point connections to seven other hubs within the 2U CPC. This bus permits any OS image within the CPC to access any PCI-Express I/O slots within the CPC A 24 ported L-Remote Link of 6X/6X, 10 Gb/s fiber permitting connections to 24 hubs of the other three 2U CPCs of a SuperNode Up to 16 ported D-Link of 10X/10X, 10 Gb/s fiber permitting connections to other SuperNodes Disk enclosures and GPFS I/O nodes This topic provides information about the 9125-F2C disk enclosures and GPFS I/O nodes. The following components determine the I/O node configuration and provide the disk storage capacity to run the nodes associated with dedicated GPFS functions. The P3 Cluster configuration contains a total of 40 GPFS nodes and 40 disk enclosures. For each Building Block, the rack and locations are as follows: Disk Enclosures are located in Rack 1 (F1) location U9, Rack 2 U9 and U13 and Rack 3 (F3) U9. (see the following figure for a visual representation of these locations) I/O node definition: Octant 0 along with PCI slots containing SAS adapters defines an LPAR that run the GPFS VDisk code. Octant 1 can be defined as a backup for GPFS. Octants 1-7 can be individually defined as independent p. 39 of 135

40 LPARs (logical partitions) for compute nodes. All 8 Octants have QCMs and HUBs populated. Each Hub contains 10 D-Links per SuperNode and 24 L-Links The CPC drawer contains 2 DCCAs All 128 memory slots contain 8 GB memory DIMMs DRAWER Slots are allocated as follows: PCI Slot 1 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 2 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 3 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 4- PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 5 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 6 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 7 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 8 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 9 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 10 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 11 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 12 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 13 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 14 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 15 PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 16 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 17 Air Flow Baffle p. 40 of 135

41 Figure 10. I/O Node p. 41 of 135

42 Utility nodes This topic describes about the Utility nodes. While the bulk of the system contains CPC drawers full of compute nodes, there is a pair of specialized Utility nodes in each of the building blocks. These are LPAR definitions for the two Utility/Service nodes located in rack 1 U13 and rack 3 U13 of every building block. Utility node definition: Processor 0 of Octant 0 along with PCI slot 17,16, 15, 14 defines an LPAR that is dedicated to run Systems Management functions (that is boot code, image management) Processors 1-3 of Octant 0 along with the PCI slots 1 through 13 defines an LPAR that can run other service functions (like login, scheduling daemons, etc.). Octant 1 is defined as a backup. Octants 1-7 is designated as compute nodes. All 8 Octants have QCMs and Torrent HUBs populated. Each Torrent HUB contains 10 D-Links and 24 L-Links The CEC drawer contains 2 DCCAs All 128 memory slots contain 8 GB memory DIMMs Service nodes have disks Utility CPC PCI slots are allocated as follows: PCI Slot 1-10 Gb E'NET-LR, 1-Port Adapter PCI Slot 2 - Air Flow Baffle PCI Slot 3-10 Gb E'NET-LR, 1-Port Adapter PCI Slot 4- Air Flow Baffle PCI Slot 5 - Air Flow Baffle PCI Slot 6 - Air Flow Baffle PCI Slot 7 - Air Flow Baffle PCI Slot 8 - Air Flow Baffle PCI Slot 9 - Air Flow Baffle PCI Slot 10 - Air Flow Baffle PCI Slot 12 - Air Flow Baffle PCI Slot 13 - Air Flow Baffle PCI Slot 14 - PCIe 600 GB HDD PCI Slot 15 - PCIe x8 Gen 2 Two Port 6 Gb/s SAS adapter PCI Slot 16 - PCIe 600 GB HDD PCI Slot 17 1 Gb Ethernet UTP 4 port Adapter, PCIE-4x/Short, LP CAP p. 42 of 135

43 Figure 11. Utility Node p. 43 of 135

44 Cluster management hardware The Cluster management hardware that supports the Cluster is placed in 42U 19 racks. The cluster management requires Hardware Management Consoles (HMCs), dual Executive Management Servers (EMS), and the associated Ethernet network switches. The HMC is a 1U IBM server. The EMS is a 4U POWER7 entry-level server Diskless Nodes Several difference node types are typically implemented without local disk access. Remote disk access will be provided. Depending on the type of data, GPFS or an NFS mounted filesystem will be used. Typically, user data will be kept in GPFS. With diskless nodes, logging must be considered carefully. In order to be able to diagnose problems when nodes crash, some of the critical HPC product log files should be put in a global file system so they are available even after the node is down. Another consideration is that some of the log files can drive a large amount of file system traffic, especially when coming from all of the compute nodes around the same time. Therefore, you need to make sure the global file system chosen can handle the traffic as well as volume. A few suggestions for where to put them are provided below, in the order of preference GPFS filesystem - except the GPFS log files themselves An external NFS server sized appropriately to handle the traffic. Consider the network bandwidth, the CPUs, the memory, and the number of hard disks. The service node - if you are going to log all HPC log files to the service node, you will need more internal drives to handle the bandwidth. Details for configuring log files on diskless nodes are included in the Installation chapter under Diskless Node Logging Configuration, on page Cluster management Cluster management describes managing clusters with 9125-F2C systems Cluster Service and Management Networks The following figure shows the logical structure of the two Ethernet networks for the cluster known as the Service network and the Management network. In the Cluster Hardware Control Network figure, the black network lines represent the Service network and the red lines represent the Management network. p. 44 of 135

45 Figure 12. Cluster Hardware Control Network The Service network is a private, out-of-band network dedicated to managing the clusters hardware. This network provides Ethernet based connectivity between the CPC drawers Flexible service processor (FSP), the rack control Bulk Power Assembly (BPA), the Executive Management Server (EMS), andthe associated Hardware Management Consoles (HMCs). Two identical network switches (ENET A and ENET B in the figure) are deployed to ensure high availability of these networks. The Management network is primarily used to boot all nodes, the designated service nodes, compute nodes, and IO nodes and monitoring of their OS image loads. This Management network connects the dual EMSs running the system management software with the various nodes of the cluster. For security reasons, both the Service and Management networks are to be considered private and not be routed into the enterprise public network. A high-level description of the main flow of data on the cluster service network and cluster management operations is as follows: After discovery of the hardware components, xcat stores their definitions in the cluster database on the Executive Management Server. HMCs, CPC drawers, and racks are discovered through Service Location Protocol (SLP) by xcat. The discovery information includes model & serial #'s, IP addresses, and others. The Ethernet switch of the service LAN is also queried to determine which switch port each component is connected. If required, the discovery can be run again while the system is up. The HFI/ISR cabling can be tested by the CNM daemon also on the EMS. The disk enclosures and their disks are discovered by GPFS services on those dedicated nodes when they are restarted. p. 45 of 135

46 The hardware is configured and managed through the service LAN that connects the EMS to the HMCs, BPAs, and FSPs Management is hierarchical with the EMS at the top and then the service nodes that manage all of the nodes in their building blocks. Management operations from the EMS to the nodes are distributed out through the service nodes. Compute nodes are deployed using a diskless image server on the service node. For information on diskless logging see Diskless Nodes, on page 44. If a utility node is used in lieu of a service node, it must have access to the cluster database, which will drive the need for the node to be diskful and have ethernet connectivity to the management network. Monitoring information comes from its sources (Racks / CPC drawers, nodes, HFI/ISR fabric, and others.) and flows through the service LAN and cluster LAN back to the EMS. This monitoring information is logged in the xcat database Systems management This topic provides information about the systems management. The xcat software and the DB2 database software are installed on the EMS, and some open source tools that xcat uses: conserver, atftp, and openslp. The IBM Resource Monitoring and Control (RMC) software is installed on the EMS for monitoring purpose. During cluster set up, xcat automatically deploys its service nodes, installs the necessary xcat, Resource Monitoring and Control, DB2 client, and open source tools on the service nodes. Some Resource Monitoring and Control software is automatically installed on the compute nodes as part of the node deployment Network management Integrated Switch Network Management (ISNM) software is a package that contains the Central Network Manager (CNM), and Network Management commands. This package is installed on the EMS. LNMC, a second piece of Network Management support, is a firmware component packaged with the Global Firmware. And LNMC is installed on the Flexible service processor (FSP) of each POWER7 drawer. Network Management can be configured to collect ISR performance counters and store them in the cluster database. This performance counter collection occurs at configurable intervals. Network Management depends on certain cluster configuration information provided by the cluster system administrator during installation and stored in the xcat cluster database. The items that Network Management depends on are Rack IDs, SuperNode and drawer IDs, and an indicator of the topology of the data network Hardware management The HPC hardware server provides a communication channel to the hardware entities on the Service or the Cluster hardware control network. This software must be configured with the IP addresses of the Flexible service processors (FSPs) and the rack control Bulk Power Assemblies (BPAs) Event management This topic provides information about IBM Toolkit for Event Analysis and Logging (TEAL) in the cluster environment. p. 46 of 135

47 TEAL provides a central location for logging and analysis of low-level system events. It provides a means for different low-level sub systems to add their events to the common event log, analyze those events, and control how problem notifications are delivered. TEAL provides event consolidation within the cluster for: ISNM is the primary event reporting path for the network management Service Focal Point consolidation hardware events reported by the set of HMCs connected to the cluster GPFS capture events that occur within the GPFS cluster using thegpfs collector node, which is typically a service node. However, if you are configuring highly available service nodes, see Highly Available Service Nodes and LoadLeveler and TEAL GPFS, on page 51. LoadLeveler a polled connection to LoadLeveler PNSD (POE) monitor network retry errors that occur within the compute nodes The administrator would be able to view alerts and manage them by: 1. Viewing the alerts and events associated with the alerts through command-line tool 2. Set up various reporting mechanisms for alerts through Listening for RMC events Via Simple file/stdout 3. Close completed alerts (automatically through Repair and Verify on the HMC) and through command line for alerts that have been reported through other mechanisms Software configuration Software configuration describes.the software available for clusters with 9125-F2C systems. The list of software available for use on the system is as follows: Primary software on the cluster: Operating System The OS is AIX 7.1B or RHEL 6 with a customized kernel supplied by IBM. The chosen OS runs on each compute node. Tools and Libraries ESSL, Parallel ESSL, HPCS Toolkit, HPC Toolkit as part of IBM Parallel Environment Languages - C, C++, Fortran, Java, UPC Parallel Programming Environments - IBM Parallel Environment (MPI, SHMEM, LAPI/PAMI) Parallel Programming Languages OpenMP, UPC, X10 Network Management Integrated Switch Network Management (ISNM) (formerly know as CNM) GFW Global Firmware level that includes Local Network Management Controller (LNMC) PHYP Power Hypervisor Debuggers IBM Parallel Debugger File systems (local, shared, network) GPFS, NFS, and ext3 System Administration tools xcat, Toolkit for Event Analysis and Logging (TEAL) Clusters Database DB2 Job Scheduling and Resource Management LoadLeveler IBM High Performance Computing Hardware Server (HWS) p. 47 of 135

Software stack The following illustration provides a pictorial representation of the Software stack. Figure 13. Software stack 2.2.5.

48 Software stack The following illustration provides a pictorial representation of the Software stack. Figure 13. Software stack Subsystem configuration This topic provides information about the different software subsystem configurations. LoadLeveler configuration: This topic provides information about the LoadLeveler configuration in the cluster environment. Here are the LoadLeveler daemons that you need to place: Central manager and a backup central manager Resource manager and a backup resource manager N number of schedd's, when N depends on the rate of jobs typically submitted All of these LoadLeveler daemons can use the cluster database for configuration information or can use traditional configuration files. While use of the cluster database is recommended, to decide whether you are p. 48 of 135

49 going to use the cluster database for LoadLeveler or not, see the following list of pros and cons: Pros : Automation of setup and configuration changes in one central location LoadLeveler Event information can be made available from the data base via TEAL Rolling update of the cluster software is supported The llconfig command can be used to change/view configuration data Cons: Additional step in installation and set up of LoadLeveler The LoadLeverler daemons listed above must be run on nodes that have access to the database. Currently, this means that the nodes must be diskful and be connected directly to the management ethernet LAN. (All of these LoadLeveler daemons also need to be directly on the HFI network.) There is an additional layer of software to be considered when debugging The schedd's also need to be on nodes that are part of the GPFS application data cluster (or some other global file system) to support the movespool function and checkpoint/restart. If the database configuration option is used, the DB2 database space is shared with xcat. The LoadLeveler scheduler component is installed on the management node to set up the configuration. On the service nodes, both the LoadLeveler scheduler and Resource Manager (resmgr) components are installed and the LoadLeveler manager daemons are run. However, if you are configuring highly available service nodes, see Highly Available Service Nodes and LoadLeveler and TEAL GPFS, on page 51. On compute nodes, the LoadLeveler resmgr component is installed. LoadLeveler jobs can be configured to run on every compute node of the cluster within the limits set forth by IBM Power 775 Availability Plus (see Power 775 Availability Plus Overview, on page 52. A service node must be specified where the Central Manager (Negotiator daemon) will run and at least one other service node as its backup. The Resource Manager list is configured to match the Central Management list, because the Resource Manager daemon reports resource status to the Central Manager. However, if you are configuring highly available service nodes, see Highly Available Service Nodes and LoadLeveler and TEAL GPFS, on page 51. At least two service nodes will be specified to be the Schedd daemons or Job Managers. The schedd keeps a copy of the job data on spool or in the database. If you use a cluster configuration file, you can specify all of the schedds for the cluster using the keyword SCHEDD_LIST. However, if you are configuring highly available service nodes, see Highly Available Service Nodes and LoadLeveler and TEAL GPFS, on page 51. LoadLeveler directories locations are specified. On diskless nodes, they are placed in a location that will not be lost on a reboot. This is suggested to be in the GPFS filesystem on compute nodes and local storage on service nodes. For information on diskless logging see Diskless Nodes, on page 44. The LoadLeveler Startd daemon runs on each of the compute nodes. The compute nodes are divided into multiple regions, each of which can be made up of a subset of SuperNodes. There is one LoadLeveler Region daemon running in each region. For example, one can set up each LoadLeveler region with two building blocks. The Region daemon is running on a Service/Utility node of a building block, with the backup Region daemon set up to run on a Service/Utility node of the other building block in the same region. The Region daemon maintains the system and adapter status for all the nodes in its managed region, and sends heartbeat status changes to the Resource Manager. p. 49 of 135

50 The following figure provides a pictorial representation of the LoadLeveler configuration. Figure 14. LoadLeveler Configuration For running jobs, the MAX_STARTERS configuration parameter is set up to permit as many tasks as there are physical cores to run on each of the compute nodes. A number of initiators for various job classes are defined using the CLASS configuration parameter. This LoadLeveler configuration permits running of applications across the system. On the EMS,, you can copy the sample configuration files LoadL_config.l and LoadL_admin.l, update the values in them, and use them to initialize the database with the LoadLeveler configuration with the command llconfig -i -f <your_config_file>. Specify only the configuration file and the value in it for ADMIN_FILE will be used for the administration file. If you have an xcat cluster configuration file, you can add that to the command with the option -t <cluster_config_file>. GPFS configuration: This topic provides information about the GPFS configuration in the cluster environment. GPFS IO nodes use released versions of LSI device-drivers for the two-ported adapter with the appropriate functions, to be used on the system. All the nodes use the ml0 driver to be able to use the HFI effectively. GPFS depends on TEAL and OS analytics to handle service and replacement of drives. GPFS 3.4 is the release of code used since that is the release, which includes support for GPFS RAID p. 50 of 135

51 Functionality and support for scaling. GPFS is to be installed on all the OSIs. The nodes physically attached to the storage enclosures are dedicated as NSD servers. Configure all storage from the disk enclosures into a single file system to demonstrate the aggregate bandwidth. The OSIs described as IO nodes in the hardware section becomes GPFS NSD servers. Three per building block, where each NSD server is a full octant with 16 dual ported adapters having full connectivity with multiple paths to two disk enclosures. A subset of these NSD servers is designated as GPFS quorum nodes. The GPFS configuration also includes other functions such as token manager, quota manager, and session nodes for HSM, collector node for monitoring. Some of these functions are located on the NSD servers while others to be co-located with other function on the utility nodes. Highly Available Service Nodes and LoadLeveler and TEAL GPFS Service nodes that are not configured to be highly available can be a logical place to run the LoadLeveler and tlgpfsmon daemons, because they are diskful, connected to the management ethernet LAN, and have access to the cluster database. But, if you are running highly available service nodes, the service nodes must be in a GPFS cluster that is separate from the application data GPFS cluster. If this is your case, you can not run the LoadLeveler and tlgpfsmon daemons on the service nodes, because the service nodes can't be in 2 GPFS clusters at the same time. Instead, you will need to carve out some "utility" nodes, likely from the same octants as some of the service nodes. For the utility nodes that are running the LoadLeveler daemons, if LoadLeveler is using plain configuration files, then those utility nodes don't need to have disks and ethernet adapters. They only need to be part of the GPFS application data cluster and be connected to the HFI network. For the utility nodes that are running tlgpfsmon, they must be GPFS monitoring collector nodes and be connected to the cluster database (and therefore have disks and ethernet adapters) Barrier-synchronization register The barrier-synchronization register (BSR) is a memory register that is located on certain POWER technologybased processors. You can write a parallel-processing application so that the application uses a BSR to perform barrier synchronization as a method for synchronizing the threads in the parallel-processing application. You can divide BSRs into arrays and assign BSR arrays to partition profiles. Each BSR array is 8 bytes long. The number of BSR arrays that are available on a managed system depends on the type of processors used on the server model. You can see the number of BSR arrays that are available on a managed system by viewing managed system properties on the HMC. The administrator will assign privileges to users to be able to use the BSR capability. More information on configuring use of BSR available in Barrier Sync Register (BSR) Configuration, on page 116. To take advantage of barrier synchronization, a parallel-processing application must be written specifically to access and write to the BSR or a BSR array. p. 51 of 135

52 2.2.6 Power 775 Availability Plus Overview Power 775 Availability Plus describes.the Power 775 Availability Plus (A+) policy for clusters with 9125-F2C systems. The following topics will be introduced: Brief introduction of the A+ policy Improvements gained using the A+ policy Areas of the A+ policy that require extra awareness A+ policy terminology Components that are affected by the A+ policy Spare policy decisions and their impacts to the A+ policy Responsibilities in implementing the A+ policy The 9125-F2C system is architected to rely on a Warranty and Service Policy referred to as Power 775 Availability Plus (A+). The principle behind A+ is, for key components, or resources, to provide spares at the time of deployment of a cluster rather obtaining replacement parts for each instance of failure. As resources fail, spare will be consumed and the failed resources will not be repaired until absolutely necessary. The intention is to provide enough spares at time of deployment to never have to repair these key components. These components are discussed below. In most cases, the A+ management policies take place once system hardware has been declared to be installed. This should be clearly stated and understood during the planning of the cluster installation so that all parties understand the ground rules. The A+ policy improves system availability by: Improving the mean time to restoration of compute resources compared to the standard methodology. Eliminating the need to wait for replacement parts in order to restore compute resources Use of A+ hardware (described below) reduces the risk of early life failure by using parts that are already in the cluster and have experienced a similar burn-in period as the other hardware in the cluster. The extent of this depends on the spare policy chosen for a particular cluster. Sparing policies are explained below Avoiding extended downtime that can be caused by complex and time consuming repair actions. Avoiding downtime of otherwise available components that share the same package as the one being repaired. For example, a single QCM repair would require bringing down the other 7 in a drawer. Avoiding the potential for inadvertently damaging other components while repairing a failed component. Depending on the spare policy chosen for a particular cluster, the spare resource may be used to temporarily increase the capacity and performance that a cluster could otherwise achieve if only the minimum number of resources were supplied at time of deployment. Examples include: Using the extra resource for additional production capacity. Using the extra resources for non-production capacity, such as testing patches, or developing applications. Note: It must be recognized that additional capacity is not guaranteed for the life of a cluster. In most cases, p. 52 of 135

53 this additional capacity will dwindle over time. Therefore, expectations should be set with this in mind. At no time should production work rely on the availability of additional capacity. Some areas of the A+ policy that require extra awareness by the customer are: Additional space may be required by the spare resources. This depends if there is spare space in the frames that can accommodate the spare resources. Additional power consumption will be required by the spare resources. This may be mitigated by the choice spare policy as described below. Additional cooling capacity may be required by the spare resources. This may be mitigated by the choice spare policy as described below. If spare resources are used on a regular basis, the perceived total cluster performance and workload capacity may decrease over time as components Power 775 Availability Plus and are not repaired. However, the minimum guaranteed performance and workload capacity will be maintained. The loss will be in excess capacity. When the initial service contract ends, the A+ policy must be reassessed to determine if repairs or more resources are required to extend the service contract. In order to maintain the minimum expected performance in a cluster, a minimum number of Compute, IO, Service and Storage components is determined. These are defined as Workload Resources. In order to maintain the number of Workload Resources, which maintains the minimum expected performace, additional hardware is supplied to account for hardware failures in the cluster. This spare hardware is defined as A+ Spare Resource. As will be explained below, the sparing method that is chosen will determine how A+ Spare Resources and Workload Resources relate to each other. There are several components in the cluster that are defined as A+ Resources : Quad-Chip Modules that contain the processors HFI Hub Modules, which include switching function and the optical modules Individual HFI links, which include switching ports, optical modules, optical module ports and cables. Disks in the 9125-F2C attached disk enclosure When a particular A+ Resource fails, the typical fence or gard operations are performed by firmware to isolate the failing component from the system. Then, the A+ policy determines if there is enough of this sort of resource in the cluster to maintain the Workload Resource level. If there are not enough A+ Spare Resources, then repair actions will be scheduled to repair enough of the failed A+ Resources to maintain the Workload Resource level to the end of service contract. The A+ policy uses expected failure rates to determine the A+ Depletion rate over the life of a cluster. The expected depletion rate determines how many A+ Spare Resources are required to maintain the required Workload Resources from the time of shipment through to the end of the service contract. This expected failure rate also determines the Refresh Threshold over time. The Refresh Threshold is the minimum number of a particular resource that must be available to assure that the required Workload Resources are maintained. As long as the A+ resources perform within the expected failure rate, no A+ resources should ever require repair through the end of the service contract. p. 53 of 135

54 Once the Refresh Threshold is crossed, repair actions are scheduled to bring the cluster above the Refresh Threshold. The target number of repairs to be made is known as the A+ Reset. The A+ Reset is determined by the age of the cluster when the repairs are being made and the expected failure rates of the A+ resources. The A+ Reset takes into consideration the phenomenon of early life failure and the tendency for part failure rates to stabilize over time. Therefore, early in the life of a cluster, the A+ Reset will bring you much closer to the original number of deployed resources than toward the end of the life of a cluster. The following figure illustrates the various resource levels, refresh threshold and A+ reset. The top curve is the Depletion rate of Total Cluster A+ Resources. This is based on the classic failure rate bathtub curve. In this case, the available resources are depleted quickly by early life failures and then the rate levels off until end of life failures start to appear, and the depletion rate increases again. Throughout most, if not all of the life of the cluster, the Refresh Threshold is above the Workload Resources level. The A+ Reset level is above the Refresh Threshold to allow for some buffer if repairs must be made. Without the A+ Reset being above the Reset Threshold, once you cross the Refresh Threshold, it will be likely to continue to cross it after every subsequent failure. Bear in mind that the curves in the figure are for illustration purposes. In practice, the definitions of the Refresh Threshold and A+ Reset will be more like step functions with intervals like every 6 months rather than the smooth curves in the figure. Also note that the Depletion Rate curve for the Total Cluster Resources and the Refresh Threshold are illustrated such that they never cross. Recall that the Total Cluster Resources are planned such that at the expected failure rates, the cluster will never require repair, which implies that the Refresh Threshold is never crossed. The A+ (A Plus) Reset is used if the Refresh Threshold is exceeded and repair is necessary. The A+ Reset is customized to a value that is intended to provide enough spare resource to prevent hitting the Refresh threshold in the future. p. 54 of 135

55 Figure 1: A+ depletion rate, refresh threshold and reset The A+ policy is customized for each cluster and is determined during the planning phase of a cluster. At that time the Workload Resources are defined. Then, the Total Cluster Resources are defined by the projected failure rates. Then, the Refresh Threshold and A+ Reset curves are defined to protect the Workload resource target over the duration of the service contract. It is important to note that there are components in the cluster that are not A+ Resources. The repair policies for these components are more traditional. Example non-a+ resources are power supplies, memory and adapter cards. You will note that a number of these non-a+ resources have redundant components that allow for deferred maintenance Impacts of node resource failures The A+ policy is designed such that a A+ failure in any kind of node will be recovered in such a way that the impact will be to the number of compute resources. This greatly simplifies the accounting and impact of failures. For example, if an IO node fails, it will be redeployed where a Compute node was originally deployed, and the Compute node will be redeployed in the failing node position. In other words, the IO node and Compute node functions are swapped between physical hardware. The same is true for Service nodes and any sort of Utility node or any node that is not a Compute node. p. 55 of 135

56 This sort of policy is possible because the cluster will be designed with spare resources to absorb non-compute node failures. If at any time the spare resources are depleted to a degree where the is no longer N+1 redundancy, the A+ policy will be to call for a repair action. For example, IO nodes can be deployed as one octant in a server drawer with 7 Compute nodes and all of the required adapter hardware and cables. Furthermore, in order to protect against adapter slot failures a second, backup drawer can be deployed above or below the primary drawer. That backup drawer can be populated with Compute nodes. If an IO node experiences a failure in a processor, it can be swapped with up to 7 Compute nodes in the same drawer before it needs to be redeployed outside of that drawer. In that case, the adapter cards and cables will also have to be redeployed. If an adapter slot fails and there are no spares in the drawer, or there is no more Compute resource within the drawer to be swapped wit the IO node, the IO node, adapter cards and cables can be redeployed to the backup drawer. If there is an adapter card failure, the adapter card is serviced in the typical manner and does not fall within the domain of the A+ policy. Consider the amount of redundancy built into this policy: IO node failure: o N+7 redundancy before redeploying to another drawer o N+15 redundancy if you consider the ability to redeploy to another drawer Adapter slot failure: o N+(number of spares in drawer) before redeploying to another drawer o N+1 if all adapter slots are deployed Adapter failures generally have at least N+1 redundancy because there is typically a primary and backup node configuration deployed where the backup node can absorb the function of the failed adapter until the repair is made Spare Policy There are three choices for the spare policy: Hot spares, Cold spares, and Warm spares. Hot spares are kept with the partition up and running at all times. While cold spares are kept powered off. Hot spares provide the capability to maintain the Workload Resources in a more seamless fashion with the p. 56 of 135

57 least amount of administrator action required. This is most advantageous when a compute resource fails. Hot spares also provide burn-in time for the spare resources, which reduces the probability of early life failure when spare resources must be used to replace the base workload resources. In the case of an IO node, Service node, or Utility node failures, the administrator may have to play a more active role in recovery to maintain expected performance levels. This may also be the case for certain HFI Hub module failures. While Compute node failures are handled by the typical job management policies which detect compute node resource problems, the customer may choose to implement policies where failed Compute nodes are placed into different job pools containing nodes that are running in a degraded mode and are to be used for particular types of workload, such as early application development, or jobs that do not have stringent performance requirements. Cold spares do not consume power and cooling energy because they are powered-off. Also, because they are not in use, one does not learn to have increased expectations above the minimum required performance and capacity of a cluster only to have that excess performance and capacity degrade as the number of spare resources are depleted. Meanwhile, the cold spares also place more responsibility on the administrator to bring them online in response to a failure. They also have more impact when they are deployed, because in order to bring a cold spare node online, you must re-ipl its drawer, which will impact any operational nodes in that drawer. Because of the impact of Cold spares to the other resources in a drawer, it is recommended to not deploy cold spares in drawers that contain anything other than Compute nodes. If this is not possible, careful consideration must be made with respect to how the other nodes will fail over (at least temporarily) during the reipl to minimize the negative impacts to the cluster capacity and performance. When attempting to save resources with Cold spares consider that while you can power off particular QCMs that contain the spare nodes, you cannot power off the hub modules in the associated octant. Warm spares consume less power and cooling energy because they are booted to partition standby and therefore are not used by applications. They come to full partition boot much more quickly than spare nodes that are powered off. They also do not require a reipl of their drawer to deploy them Job management policies with A+ Most job management policies do not need to be impacted by A+. The default behavior of avoiding Compute nodes with degraded resources can be good enough for A+. For example, when a processor fails, if a job requires all of the processors to be functional, the job management software will be aware of the failed processor and avoid that node. p. 57 of 135

58 The one attribute of job management that must consider A+ is the maximum number of nodes to be used in a job. This cannot exceed the required Workload Resources as documented when purchasing the cluster. If this policy is not enforced, it will be possible to experience a failure in the cluster that will prevent the largest jobs from starting. For example, if the Workload Resources were defined to be 30 nodes and 32 nodes were deployed with 30 for Workload Resources and 2 for A+ Spare Resources, and the maximum number of nodes in a job was defined to be 32, if any node had a A+ resource failure, jobs requesting greater than 31 nodes would fail to start. For this reason, the maximum job size should be set to 30 nodes in this example cluster. Local policies on how to leverage spare resources may require more sophisticated job management. Two typical considerations are (1) creating a pool of the spare resources in the cluster (2) creating a pool to which failed resources are added. In creating a pool of the spare resources in the cluster, you are segregating certain nodes from the general pool of resources and designating them as spare resources. This can control expectations on capacity and performance of the cluster. While you have set the maximum job size to remain at the Workload Resources level and no single user may experience greater capacity and performance on a given job, the aggregate capacity and performance that a user may experience over running multiple jobs simultaneously may lead him to believe that this is to be expected over the lifetime of the cluster. This may also be the perception by those tracking aggregate performance and job throughput in a cluster. If the spare resources are maintained and accounted separately, this could mitigate the perceived impact as spare resources are lost over time. In creating a pool to which failed resources are added, you can more clearly manage which resources have been impacted by A+ failures. This is more likely to be something that you would do in conjunction with creating a pool of spare resources. In which case, you would have the normal production pool of resources, a pool of spare resources and finally a pool of failed or degraded resources. If other job management policies are created, the implication on recovery procedures must be accounted for and documented locally. To do this, start with the Availability Plus recovery procedure documented in High perfomance clustering using the 9125-F2C Service Guide Actions in response to A+ Resource failures There are several actions that must be taken once a A+ Resource fails. These are: Discover the failure Recover from the failure Determine if this is A+ resource failure Report the failure Gather data required to determine if repair actions are required Perform repairs as necessary More details on each action are provided in the Service section. See also the following A+ Responsibilities section p. 58 of 135

59 to understand which organization is responsible for particular actions A+ Responsibilities A+ requires actions to be performed by various parties: The administrator s responsibilities are: Discover a A+ failure o The local policies for how to monitor hardware failures in the cluster will dictate whether or not this is found by polling for failures, or there is automated notification. Recover from A+ failure o This may involve running one or two scripts to move around resources o If local policy dictates, it may involve moving a particular node into a different job pool o It may involve contacting IBM Service to move adapter and cable resources Report the A+ failure to IBM Service o This will involve opening a PMH, because A+ failures are not called home to IBM. Gather and forward data to IBM o Gathering data involves running a script on the EMS o Forwarding data is typically involves ftp ing data to o Gathered data files are typically small and should not impact the EMS filesystem space IBM s responsibilities are: Determine if a repair action is required to maintain required Workload Resources level If necessary dispatch personnel to perform repairs Disk Failures While disk failures fall into the category of Power 775 Availability Plus, they are managed in a different manner. Recovery is done in the software RAID function in GPFS. Once the number of failures reaches the Refresh Threshold, GPFS reports the problem as a serviceable event and IBM service can then be dispatched to repair the failures A+ Examples In order to illustrate an example of the A+ policy in action, consider the following: A processor in a compute node fails o The failed processor will be garded out by the firmware so that it is isolated from the rest of the system. The administrator is notified by automated event forwarding to his (assuming this has been configured) The administrator checks the FRU list and determines that a A+ Resource has failed p. 59 of 135

60 The local spare policy is for Hot spares. So, no action is required in identifying the spare Compute resource. o If the policy is for Warm spares, the spare Compute node must be identified and booted to the partition (out of partition standby). o If the policy is for Cold spares, the spare Compute node must be identified and its drawer must be reipld. Therefore, all of the nodes in that drawer must first be drained of any work. The local job management policy is to not manage A+ Spare Resources and failed resources separately so the administrator takes no recovery action. Instead, he allows the job management to recognize when the node with the failed resource does not have enough resources for a job, and thus avoid using the node. The administrator contacts IBM and opens a PMH IBM request data to be gathered The administrator gathers the data by running a script on the EMS The administrator uses ftp to transfer the data to an IBM ftp server IBM examines the data If the Refresh Threshold has not been met, IBM simply records the incident for tracking purposes and takes no further action. If the Refresh Threshold has been met: o IBM examines the data to determine how many repair actions must take place to return the cluster to the required A+ Reset level at this time. The data will include a list of previous failures and IBM will have a policy to prioritize the repairs. o In the PMH, IBM will document the actions to be taken o IBM will dispatch the appropriate personnel for the repairs. o In some cases, diagnostic procedures will be run to isolate to the failing component As another example, consider the following: A processor in an IO node fails o If this is a primary IO node, the backup IO node should take over until the node with the failure recovers. o The processor will be garded out by the firmware. The administrator is notified by automated event forwarding to his The administrator checks the FRU list and determines that a A+ Resource has failed Because an IO node must have all of its resources available at all times to maintain performance, the IO node must be redeployed to another node in the cluster. The administrator will run a command on the EMS that will swap the IO node with a Compute node in the same drawer. o The impact will be such that a Compute node resource is lost instead of an IO node resource. o If there are no available nodes in the current drawer, there are contingency plans in the service procedures. The administrator contacts IBM and opens a PMH IBM request data to be gathered The administrator gathers the data by running a script on the EMS. o While the IO node resource is not impacted, the Compute node resource is impacted; therefore, the number of failures must be examined against the Refresh Threshold. The administrator uses ftp to transfer the data to an IBM ftp server p. 60 of 135

61 IBM examines the data If the Refresh Threshold has not been met, IBM simply records the incident for tracking purposes and takes no further action. If the Refresh Threshold has been met: o IBM examines the data to determine how many repair actions must take place to return the cluster to the required A+ Reset level at this time. The data will include a list of previous failures and IBM will have a policy to prioritize the repairs. o In the PMH, IBM will document the actions to be taken o IBM will dispatch the appropriate personnel for the repairs. o In some cases, diagnostic procedures will be run to isolate to the failing component Note: The repair actions at this point will be done on Compute resources A+ and Installation The A+ policies begin as soon as the 9125-F2C systems are shipped. The planning for A+ resources and the refresh thresholds take into account the possibility of failure from the moment of shipment through installation and until the end of the service contract. p. 61 of 135

62 2.3 Cluster planning Cluster planning describes the planning topics for clusters with 9125-F2C systems Planning Overview Planning a cluster using 9125-F2C systems is a complex activity. It requires many different skills and a variety of knowledge. All planning should be done in conjunction with the IBM POWER HPC Large System Review Board (LSRB). Much of the information contained within the planning chapter will be covered with that team and its designated sub-teams. Reading the planning chapter will help with understanding the key points brought up in planning sessions. Before proceeding, be sure to review Clustering systems by using 9125-F2C in this document. Note: Where to find referenced documentation is available in Cluster information resources Supported devices and software Supported devices and software describes the supported hardware and software for clusters with 9125-F2C systems. Note: The support statements below are for minimum support levels. The most recent support statements are available on the Cluster Readme website for p775; see General cluster information resources. The following software is supported in the initial release. For latest level see the Cluster Readme website for p775 and choose the appropriate service pack. Function Operating System Parallel Environment Storage application Job management Scientific Library Systems Management HFI Network Management Event Analysis Software supported AIX 7.1 PE Runtime Edition Level(s) AIX 7.1 TL0 SP GPFS LoadLeveler ESSL/PESSL xcat / ISNM TEAL The following hardware is supported: Function Hardware supported Compute nodes 9125-F2C GPFS nodes 9125-F2C Service nodes 9125-F2C p. 62 of 135

63 Function Login Nodes Disk Enclosure HMC Management Node Hardware supported 9125-F2C 9125-F2C disk enclosure feature Defined by HMC planning p Planning the compute subsystem Planning the compute subsystem describes the topics for planning the compute subsystem for clusters with 9125F2C systems. Before proceeding, be sure to review Clustering systems by using 9125-F2C in this document Operating system image The choice of operating system for compute systems will dictate the operating system used for the rest of the systems in the cluster, including management systems. The exception is the HMC which uses its own proprietary operating system Service Node to Compute Node Relationship The service nodes serve the operating system to the diskless compute nodes. There is a primary and backup service node for groups of service nodes LoadLeveler Besides typical LoadLeveler planning that is covered in LoadLeveler documentation, IBM Power 775 Availability Plus affects planning of LoadLeveler; see Planning Power 775 Availability Plus Design, on page 73. Install the Resource Manager (resmgr) component on all the compute nodes. By default, every compute node will run a Startd daemon. Some site-specific configuration decisions are made and specified using these configuration parameters: CENTRAL_MANAGER_LIST Specify a service node where the Central Manager (Negotiator daemon) will run and at least one other service node as its backup. RESOURCE_MGR_LIST This will be the same as the CENTRAL_MANAGER_LIST, unless you set different values. The Resource Manager daemon reports resource status to the Central Manager. SCHEDD_RUNS_HERE Specify at least two service nodes to be the Schedd daemons or Job Managers. The schedd keeps a copy of the job data on spool or database. If you use a cluster configuration file, you can specify all of the schedds for the cluster using the keyword SCHEDD_LIST. p. 63 of 135

64 LOADL_ADMIN List the users who will be LoadLeveler administrators. LOG SPOOL EXECUTE Specify locations for the LoadLeveler directories. On diskless nodes, place them in a location that will not be lost on a reboot. The preferred location is in the GPFS filesystem on compute nodes and local storage on service nodes. REGION stanza A LoadLeveler Startd daemon runs on each of the compute nodes. The compute nodes are divided into multiple regions, each of which can be made up of a subset of SuperNodes. There is one LoadLeveler Region daemon running in each region. For example, you can set up each LoadLeveler region with two building blocks. Using the REGION_MGR_LIST keyword in the stanza, the Region manager daemon is specified running on a Service/Utility node of a building block, with the backup Region manager daemon set up to run on a Service/Utility node of the other building block in the same region. The Region manager daemon maintains the system and adapter status for all the nodes in its managed region, and sends heartbeat status changes to the Resource Manager. MAX_STARTERS CLASS For running jobs, MAX_STARTERS configuration parameter is set up to permit as many tasks as there are physical cores to run on each of the compute nodes. A number of initiators for various job classes are defined using the CLASS configuration parameter. This LoadLeveler configuration permits running of applications across the system. On the EMS, you can copy the sample configuration files LoadL_config.l and LoadL_admin.l, update the values in them, and use them to initialize the database with the LoadLeveler configuration with the command llconfig -i -f <your_config_file>. Specify only the configuration file and the value in it for ADMIN_FILE will be used for the administration file. If you have an xcat cluster configuration file, you can add that to the command with the option -t <cluster_config_file> Performance Huge page use: Huge pages can be desirable for performance. However, one must take into account the amount of memory available before taking advantage of huge pages. This is particular true in a partitioned octant. If two LPARs are configured per octant, and only 512GB of memory per system and 4 huge pages are requested for each LPAR, the second LPAR will not be able to start, because there will be no memory left after 2 huge pages have been allocated. p. 64 of 135

65 AIX system tunables: This topic provides information about AIX system tunables. In planning AIX system tunables, the following should be considered. Use the following AIX system tunables planning worksheet to track the setting of these tunables. This should be kept up to date should testing lead to deciding to use different settings. Details on the commands are available in AIX documentation. Table 7. AIX System Tunables Worksheet Tunable Value Command for recommended setting To support larger jobs configure the maxuproc maxuproc= chdev -l sys0 -a maxuproc=512 AIX run-queue co-scheduler tuning shed_primrunq_mload= schedo -p -o shed_primrunq_mload=0 Processor folding should be disabled, vpm_fold_policy= because itcan degrade performance schedo -r -o vpm_fold_policy=0 To allow shared memory segments to v_pinshm= be pinned vmo -r -o v_pinshm=1 Support users' request for large pages lgpg_size= (if applicable) lgpg_regions= vmo -r -o lgpg_size= o lgpg_regions=x Disable enhanced memory affinity to quiet down the VMM daemon vmo -r -o enhanced_memory_affinity=0 enhanced_memory_affinity= Enable aggressive hardware prefetch, which can improve network bandwidth dscrctl -s 0x1e b Table 7. AIX System Tunables Worksheet (continued) Tunable Value Command for recommended setting Choose SMT2, which is the preferred mode for HPC smtctl -t 2 -w boot Planning Compute Hardware Placement To plan for the placement of compute hardware, refer to the planning guide for the 9125-F2C service Planning Disk Storage Subsystem Planning Storage Subsystem describes the topics for planning the storage subsystem for clusters with 9125-F2C systems. Before proceeding, be sure to review Clustering systems by using 9125-F2C in this document. This is just a brief checklist of the things that should be considered while planning the disk storage subsystem: Determine the total amount of disk space required. This will dictate how many disks are required. p. 65 of 135

66 The number of disks required will dictate the number of disk enclosures required. Details of the disk storage subsystem should be worked out with IBM during the planning stage Planning Management Subsystem Overview Before proceeding, be sure to review Clustering systems by using 9125-F2C in this document Executive Management Server The Executive Management Server (EMS) is the central point for the system administrator to manage and monitor the cluster resources. The EMS cannot be managed by an HMC that is managing p775 servers. Otherwise, TEAL will not be able to collect serviceable events from the HMC. This is critical for Availability-Plus management and the administrators ability to monitor hardware events from the EMS. See also, xcat Planning Review Cluster Installation, on page 75 to determine how you are going to set up xcat. One key activity that will drive the xcat setup is the naming convention that you will choose for the nodes. If it is supported by xcatsetup (see xcat setup ( you may use xcatsetup to ease installation. Much more information on xcat is available at title=xcat_documentation. ISNM Planning ISNM provides the ability to manage and monitor the network. The Central Network Manager (CNM) component is installed on the EMS. There is also a component in ISNM which is part of the GFW installed on each server. When planning for installation, it is important to plan for the following: Topology see Planning HFI network, on page 70. Performance counter collection see Performance monitoring planning, on page 68. Event monitoring is surfaced through TEAL see TEAL monitoring planning, on page 68 TEAL Planning The Toolkit for Event Analysis and Logging (TEAL) is installed on the EMS. The installation instructions guide you through enabling all of the packages to provide monitoring for: HFI Network events from ISNM Hardware serviceable events reported to the HMCs LoadLeveler events p. 66 of 135

67 PNSD events GPFS events For more details see TEAL monitoring planning, on page 68. LoadLeveler Planning on the EMS Install the scheduler component on the EMS. The LoadLeveler database configuration will be initialized on this node. See section... for additional information. During initialization postscript files that can be used by xcat during provisioning for LoadLeveler will be created It is not necessary to run LoadLeveler on the EMS. Cluster Database Planning The Cluster Database uses IBM DB2 Workgroup Edition. The database server resides on the EMS with clients on each Service Node. User login: User login is typically provided by login nodes see Login nodes, on page Service Utility Nodes Service utility nodes provide management subsystem connectivity to the other nodes, and they provide the operating system access for diskless nodes. LoadLeveler Planning on Service Nodes In addition, LoadLeveler manager daemons run may on the service nodes. In that case,they need to be able to access the database and also to be on the same network as the compute nodes. Install both the scheduler and resmgr components on service nodes. Python and PyODBC are required for loading into the database LoadLeveler events such as daemons down and jobs rejected and vacated. If Highly Available Service Nodes are being used, instead of service nodes, utility nodes must be used for LoadLeveler manager daemons. See Planning Highly Available Service Nodes and LoadLeveler and TEAL GPFS, on page 67. Planning Highly Available Service Nodes and LoadLeveler and TEAL GPFS Service nodes that are not configured to be highly available can be a logical place to run the LoadLeveler and tlgpfsmon daemons, because they are diskful, connected to the management ethernet LAN, and have access to the cluster database. But, if you are running highly available service nodes, the service nodes must be in a GPFS cluster that is separate from the application data GPFS cluster. If this is your case, you can not run the LoadLeveler and tlgpfsmon daemons on the service nodes, because the service nodes can't be in 2 GPFS clusters at the same time. Instead, you will need to carve out some "utility" nodes, likely from the same octants as some of the service nodes. p. 67 of 135

68 For the utility nodes that are running the LoadLeveler daemons, if LoadLeveler is using plain configuration files, then those utility nodes don't need to have disks and ethernet adapters. They only need to be part of the GPFS application data cluster and be connected to the HFI network. For the utility nodes that are running tlgpfsmon, they must be GPFS monitoring collector nodes and be connected to the cluster database (and therefore have disks and ethernet adapters) HMCs LANs Service: Cluster Management LAN: The Cluster management LAN Customer LAN: The customer LAN is used to allow access to the cluster. Typically this will connect to the EMS for administrator access and to the login nodes, for user access Cluster Monitoring Planning TEAL monitoring planning TEAL is used for monitoring the following: HFI Network events from ISNM Hardware serviceable events reported to the HMCs LoadLeveler events PNSD events GPFS events You may, in turn, choose one of the following methods to monitor TEAL. During installation, you may choose to configure one or more of this methods: Periodically query TEAL via the tllsalert command. This is an interactive method. Configure RMC to monitor the TEAL RMC listener. Configure TEAL to the administrator or operators For GPFS events, a GPFS collection node must be used. Typically, this will be a service node. However, for configurations with highly available service nodes, a utility node must be used instead; see Planning Highly Available Service Nodes and LoadLeveler and TEAL GPFS, on page 67. RMC planning RMC is used by TEAL to monitor the HMC. For more information on RMC, see the Resource Monitoring and Control documentation. RMC can also be used to instrument many difference Performance monitoring planning You will want to determine how often to collect performance counter data from the network. You can control the following parameters. The default values are given. p. 68 of 135

69 Table 2: Performance monitoring planning Parameter Description Default Performance Data Interval How often to collect data 300 seconds Performance Data collection Save Period How long to save data before summarizing 168 hours No. of Previous Performance Summary Data How many periods to save summary data 1 Hardware Serviceable Events monitoring planning The installation procedures will guide you through enabling hardware serviceable events from the HMC to be forwarded to TEAL via RMC. By monitoring TEAL (see TEAL monitoring planning, on page 68), you can then monitoring hardware serviceable events NTP Network Time Protocol (NTP) is very important to the management of the cluster. This is especially true during debug when it may become important to cross-reference timestamps in logs and error reports between different servers Planning Other Utility Nodes Login nodes Login servers are the typical method to provide user access to the cluster. Plan for enough login servers to handle the number of users expected to use the cluster. Also, plan for backup in case of failure. Be sure to locate backup login nodes in different drawers from the primaries. LoadLeveler Planning on Login nodes On login nodes, install the LoadLeveler scheduler component Site specific utility nodes Site specific utility nodes are any nodes that perform functions outside the common ones described in this document: compute, storage, service and login. The individual sites must plan accordingly for these nodes. While planning for these nodes is outside the scope of this document, a common point are to plan not only for capacity, but also for redundancy. In doing so, consider the compute resources and network resources required by a utility node. Do not co-locate primary and backup nodes in the same CEC drawer, in case there is a failure or service action that impacts the entire drawer is p. 69 of 135

70 2.3.7 Planning HFI network Choosing a topology This topic provides information about choosing a topology. Within a PERCS supernode, the Llocal and Lremote link connectivity is fixed. The set of supported topologies is defined by the manner in which supernodes are connected by D-links. Two performance metrics to consider when designing or choosing a topology are the bisection bandwidth and injection bandwidth. These are defined as follows: Bisection Bandwidth (BBW): The bisection bandwidth is the percentage ratio of the bandwidth available for communication between two halves of a cluster that have minimum aggregate link bandwidth between them to the maximum possible traffic between the two halves. Any ratio greater than or equal to 100% is considered as full bisection bandwidth and denoted as 100% BBW. Injection Bandwidth (IBW): The injection bandwidth is the percentage ratio of the total link bandwidth out of a supernode to the sum of the maximum traffic injected by the QCMs. Any ratio greater than or equal to 100% is considered as full injection bandwidth and denoted as 100% IBW. The following table shows the supported configurations and their bandwidth characteristics. The various topologies are defined using a pair of attributes the number of D links between supernode pairs (xd) followed by the maximum number of supernodes that can be supported with that many links (ysn), with the constraint that x*y = 512, which is the maximum number of D-link ports in a supernode. D link counts are supported in powers of two. Each entry in the table below shows the number of supernodes that can be supported in a given topology indicated in the first column of the row and the number of D ports used as indicated by the top entry in its column. In addition the BBW and IBW obtainable for each configuration are provided. The blank entries are configurations that are not supported. For example, a 128D (5SN) cluster with 10 D ports/hub cannot use all 10 D ports. The entries highlighted in green are full bandwidth configurations. Those in blue are overprovisioned, because there is more than 100% bandwidth. Those in red are lacking enough links to maintain 100% bandwidth. Table 8. Supported configurations and their bandwidth characteristics Number of D Ports per Hub Topology Properties D # of SNs (3 SN) BBW 100% 100% IBW 100% 100% 128 D # of SNs (5 SN) BBW 65% 100% 100% 100% IBW 65% 100% 100% 100% Table 8. Supported configurations and their bandwidth characteristics (continued) Number of D Ports per Hub 32 D (16 # of SNs p. 70 of

71 SN) 16 D (32 SN) 8 D (64 SN) 4 D (128 SN) 2D (256 SN) 1D (512 SN) BBW 16% 33% 49% 65% 100% 100% 100% 100% IBW 16% 49% 81% 100% 100% 100% 100% 100% # of SNs BBW 16% 33% 49% 65% 100% 100% 100% 100% IBW 24% 57% 89% 100% 100% 100% 100% 100% # of SNs BBW 16% 33% 49% 65% 100% 100% 100% 100% IBW 28% 61% 93% 100% 100% 100% 100% 100% # of SNs DARPA BBW 16% 33% 49% 65% 100% 100% 100% 100% IBW 30% 63% 95% 100% 100% 100% 100% 100% # of SNs BBW 16% 33% 49% 65% 100% 100% 100% 100% IBW 31% 64% 96% 100% 100% 100% 100% 100% # of SNs NCSA BBW 16% 33% 49% 65% 100% 100% 100% 100% IBW 32% 65% 97% 100% 100% 100% 100% 100% ISNM requires that the network topology identifier be specified during cluster installation. The installed topology is provided in the xcat cluster configuration file, or can be manually inserted into the Cluster Database site table as the topology attribute. This topology indicator, which is provided as a string, is also propagated to all FSPs in the cluster using the chnwsvrconfig command. The supported topology strings are shown in the following table. Table 9. Supported topology strings Topology Specifier Maximum Number of Supernodes 256D 3 128D 5 64D 8 32D 16 16D 32 8D 64 4D 128 2D 256 1D 512 8D_SDSN* 8D links between single drawer supernodes 2D_SDSN * Any topology ending in SN is for a single drawer supernode Note: Certain topologies listed may exceed the announced GA limits. Such topologies are supported on a special bid basis. p. 71 of 135

72 Planning Cabling Planning IP configuration for the HFI network Consideration should be given to configuration the IP addressing. Below is information for AIX and for Linux. AIX The hfi IP driver is configured using the standard AIX IP configuration tools including ifconfig, smit and mktcpip. For early ship machines four interfaces, hf0, hf1, hf2 and hf3 are supported. Later hardware levels will support interfaces hf0-hf7. Here is an example ifconfig command. ifconfig hf netmask up Here is an example mktcpip command. mktcpip -h c250f10c11ap29-hf0 -a i hf0 -m Linux: The ifconfig command may be used to configure the hfi IP interace. Supported IP interfaces are hf0 - hf7. These interfaces depend on the linux hfi core and ip moduless being loaded which happens automatically if the hfi_utils rpm is installed. The hf interfaces may be automatically configured if the files ifcfg-hfx, X= {0,1,2,3,4,5,6,7}, in /etc/sysconfig/network-scripts have been created. An example script /etc/sysconfig/networkscripts/ifcfg-hf0 is give below: DEVICE=hf0 NM_CONTROLLED=yes IPADDR= NETMASK= ONBOOT=yes Planning Security Other Planning Resources There is an xcat document for Hints and Tips for Large Scale Clusters ( which covers miscellaneous points on cluster configuration, tunings and other topics of interest Site Planning Site planning is driven mostly by planning for the 9125-F2C systems. This is best covered by the 9125-F2C system planning guide and in conjunction with IBM. Other planning points to consider are space and power for the management nodes and Ethernet devices, as well as other hardware outside of the 9125-F2C systems and frames. These are best covered by the corresponding p. 72 of 135

73 hardware planning guides Planning Power 775 Availability Plus Design Planning for Power 775 Availability Plus (A+) is mostly done by IBM in the form of planning for spare resources and developing the Refresh Threshold and A+ Reset curves. The customer must plan several things to understand how the A+ policy will apply to and affect the cluster. The following worksheet should be filled out. Planning Task Information Comments Mandatory Tasks Review A+ (mark when complete) See the A+ Overview Review responsibilities (mark when complete) See A+ overview section s discussion on responsibilities Spare policy Hot/Cold See the A+ overview section s discussion on Spare policy Maximum job size This cannot exceed the required Workload Resources Create A+-defective group for nodes This is used to help manage and account for failed A+ resources LoadLeveler Planning (optional) Keep spare resources in a separate pool Keep failed resources in a separate pool If yes, enter name and attributes; otherwise indicate No If yes, enter name and attributes; otherwise indicate No Other See the A+ overviewsection on job management with A+. See the A+ overview s section on job management with A+. This covers anything custom that is not covered by the above. Review and document the impacts to IBM Power 775 Availbility Plus Recovery Procedures documented in the High perfomance clustering using the 9125-F2C Management Guide and Service Guide. Be careful to document any additional or changed steps to the recovery procedure. It is recommended that IBM be contacted to help with this process. p. 73 of 135

74 Reviewing the A+ overview and your responsibilities are critical to help with the planning process. The maximum size of a job must take into consideration the A+ policy that this must never exceed the required Workload Resources, which is a number smaller than the Total Cluster Resources. For planning optional LoadLeveler characteristics, the job management policy for the cluster must be understood. The A+ overview s section on job management with A+ illustrates some techniques for job management in a A+ environment Planning Installation To plan for installation activities, review Cluster Installation, on page 75. You should accomplish the following in your review: Understand the terminology and teams involved in cluster installation. Understand the overall flow of installation, if you are a member of a particular installation team, you should understand the team's responsibilities, its dependencies on other teams, and become familiar with the detailed procedure(s) for the team. Also, identify the particular tasks within the procedures that will be your responbility. If you are coordinating the installation, you should understand the overall flow in detail and understand the key dependencies between teams Coordinating Installation This topic provides information about coordinating installation. Be sure to read the Installation section's information on the various teams. Choose an installation coordinator. Review the order of installation and determine any points in the installation process where one or more team members are dependent on other team members to complete a task. Make a checklist of these merge points and use it to coordinate installation tasks Phased Installation A phased installation consists of deploying hardware and software in stages, where some resources are made available for useful work as others are being installed. If a phased installation is required, contact IBM to work with your planning team. p. 74 of 135

75 2.4 Cluster Installation The following sub-sections describe how to install a cluster with 9125-F2C systems How to use Cluster Installation The overall installation flow is intended to be read before beginning to execute installation procedures. Because the installation of a cluster is a complex activity involving multiple teams and various interdependencies, it is important to note where a particular team's tasks fit within the overall flow of installation. The teams should understand their dependencies before proceeding. Also, an overall installation coordinator should be appointed to direct the teams and to plan activities to limit the amount of down time cause by teams waiting on dependencies. Furthermore, the installation coordinator should be sure to take into account any unique circumstances in at as particular site that could add steps and procedures not documented here, or may cause delays in dependencies being met. Once teams understand what they are trying to accomplish and where they fit into the overall flow, then they should review the detailed procedures for which they are responsible. Cluster Installation is organized into the following manner: Overview of Installation Flow, on page 82, describes the organization and flow of installation as well as certain overall topics such as the following: o Installation Terminology, on page 76, introduces key terms used in installation. o Installation Responsibilities, on page 77, describes installation responsibilities and teams. o Overview of Installation Flow, on page 82, describes the overall installation flow amongst all of the teams. It highlights dependencies as well as key tasks. Detailed installation procedures, on page 89, is organized by teams and provides the guidance to the installation steps. These refer heavily to other documentation, but, where required, it will give special instruction on how to use the other documentation. Other Installation Procedures, on page 102, contains procedures that are referenced in previous sections. Altering cluster configuration, on page, describes additional procedures specific to altering a the configuration of a previously installed cluster. Installation checklists, on page 123, provides task checklists to be used during installation, and quick command and file references to be used by the more experienced installers. Installation Responsibilities by Component, on page 126, provides a quick look up of which team is responsible installing and bringing up any particular component. Review the Table of Contents to help understand Cluster Installation's organization. p. 75 of 135

76 2.4.2 Overall Cluster Installation This section describes the overall cluster installation including responsibilities, terminology and an overview of the installation flow Installation Terminology Table 3: Network and Cluster Terminology Service Network Cluster Management Network HFI (Host Fabric Interface) Network ISR (Integrated Switch Router) network Public Network Build Cluster Base Cluster Ethernet network used to manage hardware via the out-of-band path (HMC, FSP, BPA, switch). Sometimes the service network is refered to as the control network. Ethernet network used for node deployment (network install) and system management of the service nodes The high speed interconnect used for the parallel application tasks to communicate, for GPFS data access, and to deploy/manage the compute nodes. Another name for the HFI Network. Network that provides access outside of the cluster and to the site in general. A term used when installing very large clusters in phases. It is used in conjunction with the term Base Cluster. A Build Cluster is the portion of cluster that is currently being built and will be joined to the base cluster after a certain level of stability has been achieved. A term used when installing very large clusters in phases. It is used in conjunction with the term Build Cluster. A Base Cluster is the portion of cluster to which build clusters are added. It is intended to be the cluster that will go into production. It's size increases as build clusters are added to it during each installation phase. Table 4: Boot Terminology Rack Standby BPA Standby phyp Standby LPAR Standby OLCT or Low-power mode Operating System Ready EPO power applied to the BPA. This is the first state entered when the EPO is turned on. This state is where xcat will first discover the BPAs. Minimal power applied to the FSPs. This is the state after Rack Standby. xcat is used to move from Rack Standby to BPA standby. xcat must first discover the BPAs before this can be Power applied to the FSPs and the remainder of the CEC drawer. The operating system can now be loaded. Another name used for phyp Standby. A special boot mode that applies enough power to the optical modules to provide minimal verification of link operation and cabling. Operating system successfully booted and available. p. 76 of 135

77 Installation Responsibilities Understanding responsibilities for installation tasks is a good starting point for understanding the overall installation and bring-up process. This section will describe several installation teams and their responsibilities. Then, it will indicate responsibilities with respect to the major installation procedures documented in Detailed installation procedures, on page 89. If, after reviewing the teams below, you wish to determine which team is responsible for installation and bringup of any particular component, see Installation Responsibilities by Component, on page 126. The following generic teams are defined. During the planning phase for the cluster, the customer and IBM must discuss the specifics of who will be the members of each particular team and their individual task assignments. Team Description Customer Installation Team (CIT) The customer is responsible for certain installation tasks prior to any IBM or other vendor installation tasks. Typically, the customer or a designated vendor will also be responsible for any postinstallation and bring-up tasks involving non-ibm hardware. Such tasks are not documented here. Hardware Installation Team (HIT) The hardware installation team typically consists of IBM Service Representatives (SSR) and movers employed by IBM. HFI Cabling Installation Team (HCIT) The HFI cabling installation team can consist of IBM SSRs, but quite often IBM cabling specialists are provided for the installation. Software Installation Team (SIT) The software installation team can be a combination of customer resources and IBM resource, but it is typical to employ IBM software installation specialists. Filesystem Installation Team (FSIT) While it is possible that the filesystem can be installed by the software installation team, the typical complexity of filesystem installation will often dictate the use of a special filesystem installation team. Performance Verification Team (PVT) A performance verification team may be assigned to evaluate the performance of the cluster once it has reached a desired level of stability. This is determined on a case by case basis. The specific verification procedures are also informed by the specific case. Other installation teams Other installation teams may be required to install hardware or software that is not covered in this document. For example, installation of tape archives are not covered in this document. p. 77 of 135

78 Team Description Other test and verification teams For certain special bid installations, it may be necessary to perform extended testing and verification on the cluster. Typically, these tasks are negotiated ahead of time and should be documented as part of the the installation and bring up plan. Cluster Installation Coordinator The cluster installation coordinator will direct the teams to ensure as smooth an installation and bringup as possible. The coordinator should manage the overall team activities and dependencies making sure to account for any special procedures not documented here. The detailed installation procedures that follow are document tasks as they relate to the recommended teams. This allows each team to act as independently as possible. However, dependencies should still be reviewed especially using Overview of Installation Flow, on page 82, as well as a detailed review of each of the procedures. Pre-installation tasks should be performed by the customer's installation team. Software installation should be performed by the software installation team. This includes management software, operating system images, firmware and HPC software. If a separate team is required for filesystem configuration and bringup, that team will be known as the filesystem installation team F2C system hardware installation should be performed by the hardware installation team comprised of the IBM System Service Representative (SSR) and the contracted mover to place frames. HFI network cabling installation should be performed by a hardware cabling installation team. This may be executed by the same hardware team that is installing the 9125-F2C system hardware. However, with larger clusters, separate teams may be more efficient. In this way, as one team is installing the next set of frames, the other may be cabling. If separate teams are used, the team performing the cabling is known as the HFI cabling installation team. Note: Application software is not covered by this documentation and is left to the local site to plan and install. All teams should review their respective installation sections before attempting to execute them. See the Planning section for Installation Responsibilities for information helpful to developing a cohesive group of installation teams for maximum efficiency during the installation Installation Documentation Besides this document, the various teams will require many different documents to perform the installation. The following will list the documents required by each team. Be sure that you have access to the documentation before you begin. As much as possible, the teams should also review the documentation beforehand and become familiar with the flow of documentation. p. 78 of 135

79 Table 5: Software Installation Team Documentation Document Cluster Guide Use Overall flow; ISNM specifics System Delivery Install Plan A custom document generated for each and Site particular site. Preparation Guide Location This document Should be provided to you by IBM. xcat* ( Configuration file setup How to setup the xcat configuration file ahead of time. atsetup.8.html User Management Used for planning and setting up security associated with xcat wiki/xcat/index.php? title=user_management xcat port usage Used for planning and configuring security associated with firewalls and xcat wiki/xcat/index.php? title=xcat_port_usage Installing a Linux Cluster Used for installing a Linux cluster iki/xcat/index.php? title=xcat_plinux_clusters Installing an AIX Cluster Used for installing an AIX cluster iki/xcat/index.php? title=xcat_aix_cluster_overview _and_mgmt_node Setting up a Linux Management Node Installing and planning the EMS in a Linux cluster wiki/xcat/index.php? title=setting_up_a_linux_xcat_m gmt_node Setting up an AIX Management Node Installing and planning the EMS in an AIX cluster iki/xcat/index.php? title=xcat_aix_cluster_overview _and_mgmt_node#set_up_an_aix _system_to_use_as_an_xcat_mana gement_node Configuring primary and backup EMS for high availability using shared disks. Used when installing dual EMS wiki/xcat/index.php? title=shared_disks_ha_mgmt_no de Alternative method for Configuring primary and backup EMS for high availability Used for another configuration when installing dual EMS. wiki/xcat/index.php? title=setting_up_an_ha_mgmt_no de p. 79 of 135

80 Document Use Location xcat Port Usage Planning firewall connectivity relative to the management networks awiki/xcat/index.php? title=xcat_port_usage User Management. Planning userid management on the EMS Setting up DB2 as the xcat Used for installation when configuring the Database xcat Cluster Database implemented in DB2. wiki/xcat/index.php? title=user_management wiki/xcat/index.php? title=setting_up_db2_as_the_xcat _DB xcatsetup Using xcatsetup to help with configuring a cluster. Depending on your naming atsetup.8.html conventions, this may not be usable. Review the documentation to decide. Setting up a Linux Hierarchical Cluster Used for installing a Linux cluster wiki/xcat/index.php? title=setting_up_a_linux_hierarchi cal_cluster Setting up an AIX Hierarchical Cluster Used for installing an AIX cluster iki/xcat/index.php? title=setting_up_an_aix_hierarchi cal_cluster p775 Hardware Management Used during installation for discovering and configuring hardware. wiki/xcat/index.php? title=xcat_power_775_hardware_ Management xcat commands man pages for xcat commands wiki/xcat/index.php? title=xcat_commands Other documentation TEAL On EMS: /opt/teal/doc/teal_guide.pdf On sourceforge.net: wiki/pyteal/index.php? title=main_page LoadLeveler for AIX Installation Guide Used for installing LoadLeveler p. 80 of enter/clresctr/vxrx/index.jsp? topic=/com.ibm.cluster.loadl.doc/llb ooks.html

81 Document Use Location Setting up LoadLeveler in a Statelite or Stateless cluster Setting up LoadLeveler in a Stateful Cluster -- iki/xcat/index.php? title=setting_up_loadleveler_in_a_ Statelite_or_Stateless_Cluster Setting up LoadLeveler in a Stateful Cluster ki/xcat/index.php? title=setting_up_loadleveler_in_a_ Stateful_Cluster * xcat information is found in title=xcat_documentation. It is in a wiki format. The major documentation that is required is listed above. This is not an exhaustive list for every contingency. The xcat documentation will provide links to much more detail when appropriate. You may also download an application an its pre-requisites that can convert the wiki pages to PDFs; see title=editing_and_downloading_xcat_documentation#converting_wiki_pages_to_html_and_pdfs Table 6: Filesystem Installation Team Documentation Document Cluster Guide Use Overall flow Location This document GPFS Table 7: Hardware Installation Team Documentation Document Use Location Cluster Guide Overall flow This document System Delivery Install Plan and Site Preparation Guide A custom document generated for each particular site. Should be provided to you by IBM. WCII Worldwide Custom Installation Instructions. The document title is New Mfg-Order 1AGGNC7 Ð 9125-F2C, SN 02009A7C6 ± New System Installation Guide. IBM WCII site FDT document Coolant Fill and Drain Tool documentation System Fill Field Service document p. 81 of 135

82 Table 8: HFI Cabling Installation Team Documentation Document Use Location Cluster Guide Overall flow This document System Delivery Install Plan and Site Preparation Guide A custom document generated for each particular site. Should be provided to you by IBM Overview of Installation Flow The overall installation flow can be broken down into three major tasks: Integrate Hardware and management software Bringup nodes and Integrate HPC Software Verify and Refine The following two figures illustrate the overall installation flow. The first figure describes the integration of the hardware and the management software. The second figure describes bringing up nodes, integrating HPC software and verifying and refining function of the cluster. Brief descriptions of the tasks are provided after the figures. By and large, the figures are broken into major boxes that encompass team activities. In some cases, another team may perform a task within a particular team's box. Rather than try to illustrate this with another box that may clutter the figure, an oval with a letter identifying the other team is included to the right of the task box; like M. The team identifiers are: M = contracted Movers P = Performance Verification Team O = Others C = HFI Cabling Installation Team S = Software Installation Team Above each box is text indicating the specific order of installation procedure that details how to execute the tasks. The procedures are found in Detailed installation procedures, on page 89. The flow in the figures illustrate dependencies between teams as well as the task flow for any particular team. When there is a dependency, a merge symbol is used before the task that is dependent on other team activities. Arrows from the pre-requisite tasks join into the merge symbol. is the merge symbol. In order to link tasks between the two figures and also to reduce the complexity of the figures, pairs of off-page connectors are used to indicate the flow from one task to another. Pairs of off-page connectors are identified by having the same letter within each connector, like: A p. 82 of 135

83 p. 83 of 135

84 Figure 2: Integrate Hardware and Management Subsystem p. 84 of 135

85 Figure 3: Integrate Nodes and HPC Software, and Verify and Refine Cluster p. 85 of 135

86 The following briefly describes the tasks illustrated in the above figures. There is a table of tasks for each team. Table 9: Software Installation Team (SIT) Tasks Task Install Management Software on EMS Sub-tasks and Descriptions Install the foundational packages for the EMS: xcat, HWS, DB2 Interteam Dependencies EMS up and available by HIT Install CNM, LL Scheduler, TEAL component Initial EMS Configuration and Cluster DB setup Discover frame HW (BPAs) from EMS and assign to HMCs Discover Frames 9125 Frame Installation complete by HIT with EPO turned on Update power firmware Exit Rack Standby Discover System Hardware 9125 Frame Water fill complete by HIT Verify xcat access to SFPs and BPAs to complete discovery process and add to Cluster Database. Assign HW to HMC(s) Update system firmware Bring up CECs to phyp Standby Verify Systems Update Firmware Start Network Management Verify HFI Cabling Verify Hardware Availability Check boot progress codes from EMS Update system and frame/power firmware as required HMC Verify Systems by HIT Start CNM Verify that CNM has access to the FSPs Check link status and check for miswires work with the HCIT Network Management started and HFI network cabling started Check for deconfigured resources in the CECs. Check Service Focal Point HFI Cabling done by CIT and Reverify of Systems done by HIT (if firmware updated) Final Systems Verification HW Installation Complete The currently placed and powered-on hardware has been installed, integrated and verified. Service Node Bringup Configure management hierarchy p. 86 of 135 Final Systems Verification Complete

87 Task Sub-tasks and Descriptions Interteam Dependencies Define Service Node LPAR Create Service Node image for diskful service nodes Push Service Node images out Define node LPAR for diskless compute, IO and utility nodes. Node Bringup Create node images Boot LPARs Update IO firmware Verify Network Stability Update Disk Enclosure SAS expansion card firmware, disk firmware, SAS adapter firmware, other adapter firmware netvsp prior to GPFS configuration and use of netperf all-to-all EMS has access to BPAs & FSPs Verify Monitoring HMCs have access to BPAs & FSPs TEAL has access to LPARs TEAL receives HW serviceable events from HMCs Bringup LoadLeveler Filesystem Installation Team has GPFS ready. Bringup LoadLeveler PE interactive US and IP tests Verify Protocols Start LL and run protocol test Run PESSL job Performance verification and other testing complete (as applicable) Frame/Building Block Bringup Complete Merge Return to Discover Hardware Cluster Bringup Complete An activity for large clusters which involves installng a portion of the Base and install clusters cluster and bringing it into a test or production mode while the are stable remainder of the cluster is being installed or waiting to be installed. If more hardware is available, continue All hardware has been integrated into the cluster and verified. All HPC software and the filesystem have been brought-up. All standard verification steps have been completed. Any optional verification and testing has been completed. p. 87 of 135

88 Table 10: Hardware Installation Team (HIT) Tasks Task Sub-tasks and Descriptions Place server frame(s) for EMS & HMCs Install Management Hardware Interteam Dependencies Customer preinstallation tasks complete and management hardware arrival on-site Level frame(s) Place EMS and HMC hardware in frame(s) Power-on and verify that it boots 9125 Frame Placement* This is typically performed by contracted movers and not the hardware installation team 9125 Frame Installation Level frames Power-on frames with EPO 9125 Frame Placement by movers 9125 Frame Water Fill frames with water Discover Frames by SIT Check Disk Enclosure LEDs Repair as necessary Discover System HW by SIT HMC can access FSPs & BPAs HMC Verify Systems Check boot progress codes from HMC Check Service Focal Point from HMC Top off water After systems have been running for a period of time, top off water. Wait as long as you can before topping off the water. Final Verify Systems Check Service Focal Point System Installation Complete This is a milestone. No steps to perform. HW Installation Complete The currently placed and powered-on hardware has been installed, integrated and verified. p. 88 of 135 HFI Cabling Done by HCIT and Verify Hardware Avail by SIT

89 Table 11: HFI Cabling Installation Team (HCIT) Tasks Task Sub-tasks and Descriptions Begin placing cables under the floor. Customer preinstallation tasks complete Connect cables to Frame Installation complete by HIT Check link status work with SIT on the EMS Network Management Started by SIT Route Cables Connect Cables Verify Cabling HFI Cabling Done Interteam Dependencies Check for cable miswires work with SIT on the EMS If more cables must be installed, return to route cables, or connect cables, as appropriate. Otherwise, cabling is done. Table 12: Filesystem Installation Team (FIT) Tasks Task Sub-tasks and Descriptions Verify Disk Enclosures Use xdd Verify HFI Network Use netperf Interteam Dependencies IO Node Bringup completed by the SIT. Configure GPFS Bringup GPFS Run gpfsperf IOR Build filesystem Verify Filesystem Detailed installation procedures Pre-installation tasks The following tasks are typically performed before arrival of IBM equipment and the commencement of IBM software installation. The customer installation team is responsible for these tasks. Site preparation for power and cooling Installation of site networking Installation of service and cluster management ethernet hardware p. 89 of 135

90 Cabling of service and cluster management ethernet cables Placement of the hardware for the Executive Management Server (EMS). It is possible that preinstallation tasks have been planned for the EMS. If this is the case, verify that those have been executed. Do not connect cables to management consoles, servers, or frames to the service and cluster management networks until such time as you are instructed to do so within the installation procedure for each management console. Note: Proper ordering of management console installation steps and cabling to the VLANs is extremely important for a successful install. Improper ordering can result in very long recovery procedures Software installation This topic provides information about the order of installation tfor the cluster software. These tasks are to be performed by the software installation team. The filesystem installation may be done by a separate filesystem installation team. The following is to be accomplished: Installation and bringup of the Executive Management Server including: xcat, Hardware Server, DB2 CNM and TEAL. Hardware discovery and 9125-F2C system and power subsystem firmware updates Creation, installation and bringup of node operating system images Configuration and bringup of the GPFS filesystem Configuration and bringup of LoadLeveler Verification of installation, bringup and integration steps at appropriate times. Before you begin: Review all worksheets and information generated during the planning phase. Understand the switch HFI topology is critical to proper configuration of the cluster. Topology information is input into the Cluster database in the form of site and node attributes presented in the xcat cluster configuration file. Then, these values used by CNM, which distributes these values to the FSPs for local network management operations.ensure Complete the xcat cluster configuration file. Ensure that it contains the necessary switch topology information and the agreed upon naming conventions for frames, bpa, fsp and lpars. For more information review the following xcat setup manpage. If you cannot use the naming conventions described in the xcat setup manpage, you will be instructed in the installation documents referenced below on how to configure the cluster using alternative methods. Review the installation procedure thoroughly. Verify that you can access all links that are referenced. Many of the detailed installation steps are in the xcat wiki.the xcat documentation is organized by function. At times is will be necessary to navigate multiple pages in the xcat wiki. Not all documents will be completed beginning to end. Entry and exit points will be noted. References to documents with generic procedures that can apply to many solutions will include information specifically required by a cluster using the 9125-F2C. For any documents referenced by name only, refer to the References section of this document. p. 90 of 135

91 Ensure that you have reviewed your installation responsibilities and the Installation Terminology found above, in Overall Cluster Installation, on page 76. Review the software installation checklist, and prepare the checklists to be used. Review diskless node logging in Diskless Node Logging Configuration, on page 111. Review placement of LoadLeveler and TEAL GPFS daemons in Placement of LoadLeveler and TEAL GPFS Daemons, on page 114. Review BRS configuration in Barrier Sync Register (BSR) Configuration, on page 116. Verify that the pre-installation tasks have been performed by the customer installation team. See Preinstallation tasks, on page 89. Review the Security configuration notes, below References: 1. For more information on xcat, see xcat Documentation. ( From the mentioned documentation you can access documents that guide in setting up many types of clusters. Reference specific documents to aid in this setup. Some common documents of use are under Common xcat Features topic. For example, you can access all the xcat commands and their manpages grouped by function at, xcat Commands. ( Security configuration notes: xcat automatically configures the client/service certificates and ssh keys for root (if they do not already exist) during installation. xcat automatically configures root with the authority to run xcat commands and ssh from the EMS to the service and compute nodes. For information on how to add more ids, see User Management. ( Ensure isolation of the different networks to which the EMS is connected such that there is no routing of packets between the public network and the service or management networks. Configure any firewalls that are required according to site policy: It is highly recommended that there be no firewall between the EMS and the service nodes and between the service nodes and the nodes they are servicing. If according to site policy, a firewall is necessary in either of these cases, see the instructions in xcat Port Usage. ( In the xcat Port Usage document, review the required ports that need to be open for xcat system management. For example, you may consider having the RMC internet port 657 blocked from the public network, such that RMC commands can be run only from the EMS or from the service/management networks. Alternatively, you may simply use the default the user ID and RSCT security. Note: Some documentation referenced below is generic and covers more clusters than those with the 9125-F2C system. Always take special note of any portion of a document that specifically references the 9125-F2C, Power 775 or p775. Step 1: ensure that the service and management network has been connected and configured during the pre-installation phase. See Pre-installation tasks, on page 89. p. 91 of 135

92 Connect and Configure Ethernet Switches (Service & Management Network) Step 2: Install Management Server Executive Management Server (EMS), or xcat Management Node) 1. Connect the EMS to service, cluster, and public networks or inform the hardware installation team that is responsible for connecting cables to do so. 2. From the product media, install the operating system for the EMS. 3. Install optional software (provided on media) 4. Include software updates and efixes (Fix Central or distro) 5. System tuning - for scaling 6. Configure the network connections 7. Configure the xcat Executive Management Node (EMS) For AIX: title=xcat_aix_cluster_overview_and_mgmt_node#set_up_an_aix_system_to_use_as_an_x CAT_Management_Node up to Download and install the xcat software. For Linux: title=setting_up_a_linux_xcat_mgmt_node Note: When asked to indicate the Type of install for xcat on the EMS, use Workstation. Step 3: Configuring multiple EMS Review the appropriate procedure below, and especially the referenced xcat web page with respect to the next step in this document (Step 4: Install and configure xcat). You should note any steps that are dissimilar. This is especially important for the secondary EMS. If you have multiple EMS and are using twin-tailed shared disks, use the following procedure: Install the 19-inch DASD Expansion Drawer according to the instructions included with it. Cable the DASD according to the customized planning done for the cluster Setup the shared disks for HA Management Nodes, using: After configuring the disks, pay particular attention to the secondary EMS setup, database replication and file synchronization. If you have multiple EMS, and are not using twin-tailed shared disks, use the following procedure: Step 4: Install and configure xcat and install other HPC Software on the EMS. The order of steps presented in referenced documents is important. Each step should be completed before the next one is attempted. Use the following documentation to perform the initial setup of the EMS. You must follow the documentation that applies to the operating system installed in the managed servers in the cluster. p. 92 of 135

93 You will execute the following tasks: 1) Install xcat and it's prerequisite software 2) Preliminary configuration of the xcat cluster database 3) Setup of Domain Name Resolution 4) Creation of file systems for install resources (/install) 5) Configure DHCP, and configure conserver as needed For AIX: title=xcat_aix_cluster_overview_and_mgmt_node ( Follow the referenced document instructions to sections entitled Next Steps ). For Linux: ( Follow the referenced document instructions until you reach Setup xcat MN for a Hierarchical Cluster. Then, return to this document) At this point, the EMS has been configured with the initial xcat configuration. Cluster database configuration Before proceeding, ensure that you have performed the following steps to begin configuring the cluster database. If these have already been executed, continue on to configure the hardware service network on the EMS. Convert xcat from SQLlite database to a DB2 database CNM and other software require DB2. After the DB2 installation a script is provided that will do the conversion for you. The manual steps which are included are good reference in case troubleshooting needs to take place. For AIX and Linux: Prime the xcat cluster database using xcatsetup and create /etc/hosts master file. If you are not using xcatsetup, because your cluster naming conventions do not apply, you will be instructed in other referenced documentation in subsequent steps on how to use alternate methods for setting up the xcat cluster database. For AIX and Linux: Configure hardware service network on the EMS: Before proceeding, ensure that the following has been completed. Typically, these steps are performed by various other team members. Consult with the other teams and the installation coordinator to determine who should execute these steps and when they have been executed. p. 93 of 135

94 Configure the hardware service network on the EMS. The cable and switch setup may have been done by another team. This is a summary of what should be setup. Cable the EMS and HMC to the A-side cluster service network. Cable the frame ports (1GB T19 and T36 ports on the A-side BPCH) to the A-side cluster service network. If you are only configuring one ethernet cable per BPCH, use the T19 port. For a diagram see the Worldwide Custom Installation Instructions for the 9125-F2C. Cables the EMS and HMC to the B-side cluster service network. Cable the frame ports (1GB T19 and T36 ports on the B-side BPCH) to the B-side cluster service network. If you are only configuring one ethernet cable per BPCH, use the T19 port. For a diagram see the Worldwide Custom Installation Instructions for the 9125-F2C. Continue configuration of HPC applications on the EMS and configure hardware: Use the documentation that applies to the operating system installed in the managed servers in the cluster. You will execute the following tasks: 1. Switch to the DB2 database under the Switch to a relational database step. 2. Setup for a Power 775 Cluster (the list is an overview of tasks to be done, and does not necessarily imply an order of installation) If not already done, create an xcat cluste r configuration file. Record where this has been saved. Install HPC Hardware Server Discover and define hardware components Note: You can only discover hardware components that have been installed to the point where they can be powered on. You will be using xcat Power 775 Hardware Management. ( title=xcat_system_p7_775_hardware_management#power_on_the_fsps.2c_discover_them. 2C_Modify_Network_Information.2C_and_Connect ) Install LoadLeveler - review placement of LoadLeveler daemons in Placement of LoadLeveler and TEAL GPFS Daemons, on page 114 Install TEAL - review placement of the TEAL GPFS daemon in Placement of LoadLeveler and TEAL GPFS Daemons, on page 114 Install ISNM Configure and start ISNM Discover and Define your cluster nodes Define your Service ( Utility) nodes Create the images for your cluster and service nodes -review diskless node logging in Diskless Node Logging Configuration, on page 111. Install your Service nodes Add additional network configurations (HFI) Install your cluster compute nodes Recall the need to configure diskless node logging as described in Diskless Node Logging Configuration, on page 111. Recall the need to configure BRS capability as described in Barrier Sync Register (BSR) Configuration, on page 116. Setup of GPFS I/O Servers Start Loadleveler on service nodes; Then, start LoadLeveler on compute nodes. Setup login nodes p. 94 of 135

95 Setup backup Service Nodes Note: If the following procedures do not indicate how to configure ISNM, but only reference when to do so, temporarily return to this document and perform the procedure in Configure ISNM, on page 102. After performing that procedure, return to the xcat procedure. For AIX: title=setting_up_an_aix_hierarchical_cluster For Linux: title=setting_up_a_linux_hierarchical_cluster Note: You have already installed xcat and do not need to repeat that in Setup the MN Hierarchical Database. However, in Setup the MN Hierarchical Database, you do need to convert the database to DB2. After performing the above procedure for the appropriate operating system, ensure that you performed the steps to configure xcat to help with A+ resource management. Step 5: Verify hardware boot progress codes Before you begin: Contact the Hardware installation team and ensure that the CEC drawers have power applied to them. Perform the following procedure: Power on the CEC drawers: rpower cec on Run: rvitals cec lcds watch it changing, because the boot can take up to an hour rpower cec stat IPL-in-progress ; power-on-transition; when done = operating; or standby 3. Verify all nodes are at phyp (or LPAR) standby with no errors. Any nodes that have power applied should be at this state. If waiting does not resolve the state issue, consult with the Hardware installation team regarding a plan to resolve the problem. Step 6: Verify the HFI network Before you begin: Contact the HFI network cabling installation team and ensure that HFI network cables have been connected to CEC drawers Contact the Hardware installation team and ensure that the CEC drawers to which the HFI p. 95 of 135

96 network cables have power applied to them. If not already started, start CNM: /usr/bin/chnwm -a Perform the procedure in Using ISNM to verify the HFI network, on page 105 Step 7: Verify Hardware Availability Run: rinv all deconfig Nothing should come back as deconfigured; the only thing listed should be the planar for each CEC drawer U78A9.001.xxxxxxx-P1. Any deconfigured resources are considered to be a problem and should be referred to the hardware installation team. A decision on replacement will be made according to the Availability Plus policy for the site. Step 8: Final Verify of the 9125-F2C systems Verify that the hardware installation sees no remaining actionable problems in Service Focal Point before proceeding. At this point the hardware installation and management subsystem integration phase is complete. Step 9: Bring up the service nodes If you have not already done so, configure and bringup the service nodes according to the sections listed below in the appropriate link for the operating system being used: AIX Setting up an AIX Hierarchical Cluster ( title=setting_up_an_aix_hierarchical_cluster ) Linux Setting up a Linux Hierarchical Cluster ( title=setting_up_a_linux_hierarchical_cluster ) Define the service nodes: title=setting_up_a_linux_hierarchical_cluster#define_the_service_nodes_in_the_ database Setup the service nodes: title=setting_up_a_linux_hierarchical_cluster#set_up_the_service_nodes_for_dis kfull_installation Install the service nodes: title=setting_up_a_linux_hierarchical_cluster#install_or_stateless_boot_the_servi ce_nodes_on_linux Test the service nodes: p. 96 of 135

97 title=setting_up_a_linux_hierarchical_cluster#test_service_node_installation Configure the service nodes: title=setting_up_a_linux_hierarchical_cluster#configure_service_node_for_p7_i H.28Optional.29 Configure the backup service nodes: title=setting_up_a_linux_hierarchical_cluster#setup_backup_service_nodes Step 10: Bring up the other node types Choose the correct operating system and perform the procedures listed for it. AIX For Diskless installs: ( Recall the need to configure diskless node logging as described in Diskless Node Logging Configuration, on page 111. Recall the need to configure BRS capability as described in Barrier Sync Register (BSR) Configuration, on page 116. For Diskful installs: ( Linux Setting up a Linux Hierarchical Cluster ( title=setting_up_a_linux_hierarchical_cluster ) Define and install the compute nodes: title=setting_up_a_linux_hierarchical_cluster#define_and_install_your_compute _Nodes Recall the need to configure diskless node logging as described in Diskless Node Logging Configuration, on page 111. Recall the need to configure BRS capability as described in Barrier Sync Register (BSR) Configuration, on page 116. Define and install the GPFS IO nodes: title=setting_up_a_linux_hierarchical_cluster#setup_of_gpfs_i.2fo_server_node s Define and install login nodes title=setting_up_a_linux_hierarchical_cluster#using_login_nodes Other nodes Other may be similar to either GPFS IO or compute nodes depending on whether or not they have adapters associated with them. Choose the procedure for the node type that most closely resembles the other node that is being installed. Close attention should have been paid to planning for this during the planning phase. p. 97 of 135

98 Step 11: Update IO firmware Perform the procedures to update the following firmware associated with the IO nodes: SAS Expansion cards in the Disk Enclosure Disks in the disk enclosure SAS adapters Other adapters Step 12: Verify Network Stability To verify network stability, do the following: ensure that all nodes are booted to the operating system Run netvsp Run an all-to-all ping test Step 12: Verify Monitoring Capabilities You will verify the following: EMS has connectivity to all BPAs and FSPs HMCs have access to all BPAs and FSPs TEAL has access to LPARs TEAL receives hardware serviceable events Step 13: Verify Protocols You will verify the following PE interactive US and IP tests Enable LoadLeveler and run protocol test Run PESSL job Step 14: BringupLoadLeveler Configure the LoadLeveler Central Manager using the configuration file that should have been planned. Also configure LoadLeveler with respect to A+. Refer to the worksheet in Planning Power 775 Availability Plus Design, on page 73 and configure the following: The maximum number of nodes in a job A pool to manage spare resources (optional) see Step 16 for where node groups are configured A pool to manage failed/degraded resources (optional) see Step 16 for where node groups are configured Step 15: Tuning for performance Before final verification of the HFI network, it would be advantageous to tune the HPC software stack to achieve a higher level of performance. Step 16: Final verification of HFI network p. 98 of 135

99 Verify the HFI network configuration using the procedure, ISNM Network Verification. Stress the network using netvsp Stress the network using an all-to-all application Step 17: Configuration to support A+ spare policy: Regardless of operating system, the following are done to configure xcat to help with A+ resource management. Perform these steps as indicated: ensure that A+ Data gathering is installed as part of xcat: /opt/xcat/sbin/gatherfip As you are configuring your node groups, configure a group to manage A+ resources. Nodes that are considered to be A+ resources should be defined as members of this group. mkdef -t group -o Aplus_resources As you are configuring your node groups, set up a group to track defective A+ resources: mkdef -t group -o Aplus_defective If the A+ spare policy is for Hot spares, there is no more that needs to be done. If the A+ spare policy is for Cold spares, perform the following after all nodes have been installed, configured and verified. Do not power off a spare node until it has been verified as functional. 1. Power off all cold spares 2. Keep a careful record of all cold spares and when they are used If the A+ spare policy is for Warm spares, perform the following after all nodes have been installed, configured and verified. Do not Do not power off a spare node until it has been verified as functional. 3. Reboot all hot spares to partition standby 4. Keep a careful record of all warm spares and when they are used. Step 18: Create access for the SSR to the EMS Create an id for the SSR to log on to the EMS. If this is not acceptable to the site security policy, an administrator must be asked to perform certain operations that are critical to hardware maintenance, repair and verification. The SSR should be able to perform the following operations and commands: /usr/bin/lsnwlinkinfo /usr/bin/lsnwmiswire /usr/bin/lsnwdownhw /opt/teal/bin/tllsalert /opt/teal/bin/tllsevent xdsh /usr/lpp/mmfs/bin/mmlspdisk all replace xdsh /usr/lpp/mmfs/bin/mmchcarrier Software installation ends here. At this point, customer application installation and tuning may begin.. p. 99 of 135

100 F2C system hardware installation The following tasks must be accomplished to install the 9125-F2C hardware using the 9125-F2C hardware installation manual. Execute this for each frame of 9125-F2C systems. The team performing these installation tasks is comprised of: 1. The contracted movers, who place equipment on the floor. 2. The IBM System Service Representatives (SSRs) who install the 9125-F2C systems. Before you begin: Review all worksheets and information generated during the planning phase. Review the installation procedure thoroughly. Verify that you can access all links referenced. For any documents referenced by name only, refer to the References section of this document. Ensure that you have reviewed your installation responsibilities and the Installation Terminology section above. Verify that the site team has finished preparing enough of the floor for installation of the 9125-F2C cluster. This includes site power, Tasks 1. Place the hardware on the floor. Carefully follow all instructions on moving, placing and leveling the frames for the 9125-F2C. Note: The above task is performed by the movers. The remaining tasks are performed by the IBM SSRs. 2. Connect the line cords 3. Connect Ethernet cables for the service and cluster management networks 4. Fill the water in the frame 5. Power on the frame 6. Verify that the 9125-F2C systems have successfully reached LPAR standby a) An HMC must be installed and assigned to manage the frame. The HMC assignment is performed by the software installation team. Do not install an HMC with DHCP enabled, as is done with other Power systems. b) On the HMC, check the boot progress codes for all 9125-F2C systems and verify that they have reached LPAR standby c) On the HMC, check Service Focal Point on the HMC The above tasks are given in detail in the 9125-F2C installation guide ( ureadds=wciixp&featuresubs=&.cgifields=cuiiserver&.cgifields=outputtype#p7ih_rk_placement) If you are also responsible for HFI Network Cabling Installation, you may now begin cabling the HFI network to this frame. While it is possible to do this before power is applied to the frame, you will find that you cannot verify the cabling until the power is applied HFI network cabling Installation p. 100 of 135

101 This may be executed by the same team that is installing the 9125-F2C system hardware. However, with larger clusters, separate teams may be more efficient, in which case, the HFI cabling installation team will perform the following tasks. Before you begin: 1) Review all worksheets and information generated during the planning phase that refer to HFI network cabling. 2) Review the installation procedure thoroughly. Verify that you can access all links referenced. 3) For any documents referenced by name only, refer to the References section of this document. 4) Ensure that you have reviewed you installation responsibilities and the Installation Terminology section above. 5) Verify that the site team has finished preparing enough of the floor for installation of the 9125-F2C cluster. This includes site power, Use the 9125-F2C hardware documentation to understand locations for HFI network connectors and how to handle and plug cables and cable assemblies. The hardware documentation should cover the following tasks: Run cables between the frames Route and dress the cables in the frames assuring that all cables are as accessible as possible for future service and that cable tension has been relieved. Plug the cable ends into the HFI network connectors on the 9125-F2C systems bulkheads. Verify the cabling as you proceed. Note: This can only be executed once the EMS is installed and CNM can reach and manage all of the BPAs and FSPs for the systems being cabled. a) Log on to the EMS b) Verify that all of the links that have been cabled thus far are up and operational Get a count of the number of links up and operational, and verify that the number is correct. This should match the number of cables connected to powered-on systems, including LRlinks. lsnwlinkinfo grep UP_OPERATIONAL wc l If the number of links is not as expected, find the list of links that are not up and operational, run: lsnwdownhw -a Repair any down links by replacing cables and running diagnostics as necessary. To find the complete list of up and operational links, run: lsnwlinkinfo grep UP_OPERATIONAL c) Verify that there are no miswires lnwmiswire In the output of the lsnwmiswire command, actual and expected locations are provided in anisnm hardware logical location. Recable as necessary to eliminate the miswires p. 101 of 135

102 Return to step 4.b to verify link connectivity, in case the cables were not properly plugged upon repair. d) Once all cables have been verified this sub-procedure is finished Repeat steps 1 through 4 until no more frame hardware is available and all cables have been installed and verified This procedure ends here. Note: Location mapping from ISNM hardware logical location to 9125-F2C physical, or service location format can be found in ISNM location naming conventions. Note: While this procedure endeavors to verify cabling and network stability as much as possible, application stress tests still need to be run to wring out marginal links. This is done toward the end of Order of software installation. Therefore, it is possible that network diagnosis and repair will have to be performed after this procedure has been completed Other Installation Procedures Configure ISNM This procedure describes how to configure ISNM to manage the HFI network. This can be done prior to arrival, cabling and power-on of hardware. Once the hardware is ready, use ISNM Network Verification. Before you begin: Review the ISNM configuration procedure in its entirety Ensure that the ESM is installed and that xcat and the DB2 database is ready Ensure that ISNM has been installed Ensure that CNM is not running (Linux: service cnmd status; service cnmd stop; AIX ); Note: In Linux, chkconfig -list doesn't list cnmd, because it should not be started on boot. Ensure that hardware discovery has been performed and xcat can see all of the FSPs and BPAs The tasks that you are going to perform are: Configure and verify the configuration of the site table information for ISNM. Configure frame Ids and supernode information Configure the cluster hardware server to allow CNM access to the FSPs and BPAs Run ISNM commands to customize the configuration of ISNM Execute this procedure 1. Configure the site table: site.topology a) Indicate the cluster network topology. This is based loosely on the number of D-links into a supernode; like 8D, 16D, 32D, 128D Refer to network planning information for which values to choose. A worksheet should have been filled out ahead of time during the planning phase. This is done in the xcat cluster configuration file. You should have recorded where this is stored, or used the recommended path in /install. p. 102 of 135

103 chdef -t site -o mastersite topology=[topology string] eg. chdef -t site -o mastersite topology=8d Configure the frame IDs in ppc.id and nodetype.nodetype Configure the super node number to Frame and cage ID mapping in ppc.supernode Verify that hardware server is running. Start it if it is not running. For AIX: > ps -eaf grep hdwr_svr root Jun 21-0:19 /opt/isnm/hdwr_svr/bin/hdwr_svr If it is not running, execute, /opt/isnm/hdwr_svr/bin/hdwr_svr For Linux: >service hdwr_svr status hdwr_svr (pid 28631) is running... If it is not running, execute: service hdwr_svr start 5. Configure the cluster hardware server to allow CNM access to th.e FSPs and BPAs These are in addition to the hardware connections previously configured for xcat. mkhwconn bpa t T fnm mkhwconn fsp t T fnm Verify the hardware connection. # Check the number of FSP connections. There should be two per CEC drawer. lshwconn fsp -T fnm wc -l # Check the number of BPA connections. There should be two per frame lshwconn bpa -T fnm wc -l If there is a problem, you should list all of the connections by rerunning the command without piping it into wc -l. The results will look like the following. Pay close attention to both sides of > lshwconn fsp -T fnm f14c01fsp2_a: sp=secondary,ipadd= ,alt_ipadd=unavailable,state=line UP f14c02fsp2_a: sp=secondary,ipadd= ,alt_ipadd=unavailable,state=line UP f14c01fsp1_a: sp=primary,ipadd= ,alt_ipadd=unavailable,state=line UP f14c02fsp1_a: sp=primary,ipadd= ,alt_ipadd=unavailable,state=line UP > lshwconn bpa -T fnm f07c00bpca_a: side=a,ipadd= ,alt_ipadd=unavailable,state=line UP : side=b,ipadd= ,alt_ipadd=unavailable,state=line UP : side=a,ipadd= ,alt_ipadd=unavailable,state=line UP f07c00bpcb_a: side=b,ipadd= ,alt_ipadd=unavailable,state=line UP p. 103 of 135

104 Rerun the mkhwconn command associated with the failing tooltype given in the -T parameter. If the failure persists, you can list what's in the cluster hardware server configuration file using the following. It may be necessary to call your next level of support for further debug. grep fnm /var/opt/isnm/hdwr_svr/data/hmcnetconfig The output will look like the following. For BPA connections, instead of starting with FSP, the entry starts with BPC. FSP tooltype=fnm -netc=y -authtok=00de138dc542df8ff2cfbfd7f072f67da18b4fd3bb1924ad695de553c2278b5 EDA578358C0CC16FD6EF D495E5166DCCCE A3D6CC5BC3FB796B093CACF DAE75335D77 -authtoklastupdtimestamp= mtms=9125-f2c*0283f56 -slot=a -masterslavecounter=0 -ignorepwdchg=n 6. Stop CNM, if it is running. Check regardless of using AIX or Linux: ps ef grep cnmd Stop using AIX /usr/bin/chnwm d Stop using Linux service cnmd stop 7. Verify that all servers are powered down to FSP standby. (power off) rpower cec stat = power off lsnwloc any in RUNTIME_CNM_EXCLUDED need to be brought down to fsp standby. rpower cec off 8. Start CNM. Using AIX: /usr/bin/chnwm a Using Linux: service cnmd start 9. Verify it is running: > service cnmd status cnmd (pid 6908) is running... Store server configuration data in system firmware on all servers. This is placed in persistent store that remains until firmware is updated. Note: For more information on ISNM commands, see the ISNM Command section in this document. p. 104 of 135

105 a) Verify state of the system firmware as seen by CNM. Run: lsnwloc grep STANDBY_CNM_EXCLUDED b) If any instance has a status of STANDBY_CNM_EXCLUDED, run the following to store the configuration data in the system firmware. Otherwise, skip to the next step. chnwsvrconfig -A c) Wait several minutes, then run: lsnwloc grep -v STANDBY (STANDBY is the state we want) If any result returns with STANDBY_CNM_EXCLUDED, wait several minutes and try again. If this persists, reissue the chnwsvconfig -A command one more time. If the problem still persists, call your next level of support. Every should be at STANDBY. d) Ensure that the Cluster Database, CNM and all instances of system firmware have the same topology information: 1. Cluster Database topology: lsnwtopo 2. CNM topology: lsnwtopo -C 3. System firwmare: lsnwtopo --all grep <topology from CNM> For example: lsnwtopo -all grep -v 8D If CNM does not match the Cluster Database topology, and this is the first time you've used this procedure, go back to the step that stops CNM and try the procedure again. If this problem persists, call your next level of support. If any system firmware instance does not match the CNM topology, it will be returned by the grep -v. Go back to the step that stores the configuration data in system firwmare and repeat the procedure. If this persists, call your next level of support. 10. You may now power on the CECs rpower cec on 11. This procedure ends here. Return to whichever procedure or documentation referenced this procedure Using ISNM to verify the HFI network The following describes the procedure to verify the HFI network using ISNM. This is to understand the stability of the network. This may be repeated multiple times and as a sub-task to other procedures: Initial HFI network hardware and cabling integration Before and after stress tests After repairing a link The following describes at a high-level the verification steps for the HFI network using ISNM: Verify the CNM configuration of performance counter tracking, routing, recovery behavior, and RMC interface, as well as site information for the topology and for the event analysis reporting HMC. Verify that all of the BPAs, FSPs and links are recognized by CNM. Check the state of the links p. 105 of 135

106 Search for faulty link, HFI or ISR hardware Search for miswires Verify that all ISNM functions agree on the same topology information. Step 1: Verify CNM Configuration: Verify that all of the 9125-F2C systems are managed by CNM and that all supernode and drawer assignments are as expected. > lsnwloc Example output: FR008-CG09-SN001-DR2 RUNTIME FR008-CG08-SN001-DR1 RUNTIME FR008-CG11-SN002-DR0 RUNTIME FR008-CG04-SN000-DR1 RUNTIME FR008-CG13-SN002-DR2 RUNTIME FR008-CG07-SN001-DR0 RUNTIME FR008-CG14-SN002-DR3 RUNTIME FR008-CG03-SN000-DR0 RUNTIME FR008-CG10-SN001-DR3 RUNTIME FR008-CG06-SN000-DR3 RUNTIME FR008-CG05-SN000-DR2 RUNTIME FR008-CG12-SN002-DR1 RUNTIME >lsnwloc wc -l The result should equal the total number of CEC drawers in your cluster. If a problem is found in the above, check: - Verify CNM is running: ps -eaf grep cnmd grep -v grep - Verify db2 is running: ps -eaf grep db2 grep -v grep - Verify xcat is running: ps -eaf grep xcat grep -v grep - Verify that the xcat ppc table is filled out properly. - Check the CNM /var/opt/isnm/cnm/log/eventsummary.log for strings like the following: Unable to Connect to XCATDB Unable to get configuration parameters from isnm_config table Unable to get cluster information from XCATDB MISMATCH: Expected frame 0 cage 0 supernode 0 drawer 0 topology 0 numdlink 0, but received frame 14 Verify that the ISNM configuration for performance counter tracking, recovery behavior, and RMC interface are as expected >lsnwconfig p. 106 of 135

107 ISNM Configuration parameter values from Cluster Database CNM Expired Records Timer Check: 3600 seconds Hardware Indirect Routing Scheme: Round-Robin (0) RMC Monitoring Support: ON (1) No.of Previous Performance Summary Data: 1 Performance Data Interval: seconds Performance Data collection Save Period: 168 hours CNM Recovery Consolidation timer: 300 seconds CNM Summary Data Timer: seconds Verify that the topology specifier is as expected >tabdump site /usr/bin/grep "ea.*hmc" Verify that the topology matches the number of D-links per supernode and that the primary and secondary reporting Event Analysis HMCs are as expected. "ea_primary_hmc","c250hmc05.ppd.pok.ibm.com",, "ea_secondary_hmc","c250hmc06.ppd.pok.ibm.com",, Verify that the primary and secondary reporting Event Analysis HMCs are as expected. >tabdump site grep ea.*hmc* Step 2: Verify all network hardware recognized by CNM: Check the number of BPAs. There should be 2 returned for every frame that is powered on. One for the A-side BPA and one for the B-side BPA. lsnwcomponents grep BPA wc l If the number of BPAs doesn t match twice the number of frames, then one or more BPA connections are missing. Use the following to dump the list of BPAs and look for the missing one. It may be that an A side or B side is missing. As long as one is available, the network will be okay. However, the problem should be fixed. lsnwcomponents grep BPA Check the number of FSPs. There should be 2 for every 9125-F2C system that is powered on; one for the A-side FSP and one for the B-side FSP. lsnwcomponents grep FSP wc l If the number of FSPs doesn t match twice the number of 9125-F2C systems, then one or more FSP connections are missing. Use the following to dump the list of FSPs and look for the missing one. It may be that an A side or B side is missing. As long as one is available, the network will be okay. However, the problem should be fixed. Because the number of BPAs was previously verified, the problem is most likely between the 9125-F2C system and the BPA, or the 9125-F2C is powered-off. lsnwcomponents grep FSP p. 107 of 135

108 Check that all of the 9125-F2C systems are managed by CNM. There should be one for every 9125-F2C that is powered on. lsnwloc wc l If the number of 9125-F2C systems doesn t match expected number, use the following to dump the list of 9125-F2C system locations and look for the missing one. Because the BPA and FSP connections were verified previously, there may be a firmware problem or problem in CNM preventing the recognition of the system. lsnwloc Verify that none of the lines come back with STANDBY_CNM_EXCLUDED status. Wait several minutes and try again. If this persists, call your next level of support. FR001-CG05-SN001-DR0 STANDBY_CNM_EXCLUDED If any actions are taken on the hardware, return to the beginning of Step 2 of this procedure to ensure that nothing was critical disturbed and broken during the repair process. Step 3: Verify topology: This verifies that the copies of the topology information in CNM memory, in the Cluster Database and in all of the instances of the ISNM service processor component agree. Get the topology that is specified in the cluster database: >lsnwtopo ISR network topology specified by cluster configuration data is 8D If there is a problem found, check the following: Verify CNM is running: ps -eaf grep cnmd grep -v grep Verify db2 is running Verify that the xcat ppc table is filled out properly. Check the CNM /var/opt/isnm/cnm/log/eventsummary.log for strings like the following: Unable to Connect to XCATDB Unable to get configuration parameters from isnm_config table Unable to get cluster information from XCATDB MISMATCH: Expected frame 0 cage 0 supernode 0 drawer 0 topology 0 numdlink 0, but received frame 14 Get the topology that CNM is using and verify it matches the topology in the cluster database. If there is a mismatch, call your next level of support. >lsnwtopo -C Get the topology that has been pushed out to the servers. >lsnwtopo # Get the topology that has been pushed out to the servers: > lsnwtopo -A p. 108 of 135

109 Frame Frame Frame Frame Frame Frame Frame Cage Cage Cage Cage Cage Cage Cage 11 : Topology 8D, Supernode 49, Drawer 0 5 : Topology 8D, Supernode 2, Drawer 0 8 : Topology 8D, Supernode 32, Drawer 0 10 : Topology 8D, Supernode 48, Drawer 0 7 : Topology 8D, Supernode 17, Drawer 0 3 : Topology 8D, Supernode 0, Drawer 0 4 : Topology 8D, Supernode 1, Drawer 0 If there is a problem found with the topologies that are pushed out to the servers, you can try to recover using the chnwsvrconfig command documented in the High perfomance clustering using the 9125-F2C Management Guide. Other mismatch problems are documented in High perfomance clustering using the 9125-F2C Service Guide under lsnwtopo problems. Step 4: Search for faulty network hardware: The ability to recognize all of the network hardware has been verified. Now, look for failures. If all cables are connected, use the following command: lsnwdownhw grep v DOWN_POWEROFF If some cables are disconnected, use the following command: lsnwdownhw egrep v DOWN_NBRNOTINSTALLED DOWN_POWEROFF If anything returns, perform the recovery actions listed for the status, which is the last field returned. The recovery actions are listed in Network management (ISNM) command reference (in this document) under the lsnwlinkinfo command. If a DOWN_MISWIRE exists, you may use the instructions in Step 4, to help resolve the problem. If any actions are taken on the hardware, return to the beginning of Step 2 of this procedure to ensure that nothing was critical disturbed and broken during the repair process. Step 5: Check the state of the links Check the number of links that are operational by CNM. There should be one for every D-link and LRlink that is connected and operational. If frames or 9125-F2C systems are powered-off, then the number of operational links will be decreased accordingly. Check the number of D-links: lsnwlinkinfo grep UP_OPERATIONAL egrep D[0-9] wc l The count is typically [number of CEC drawers] * [number of CEC drawers 1]*[D-link topology], where num CEC drawer = lsnwcomponents grep "FSP Primary" wc -l D-link topology = lsnwtopo For example, 3 supernodes using an 8D topology, would be: p. 109 of 135

110 3 * (3 1) * 8 = 48 For 8 supernodes using 8D topology: 8 * (8 1) * 8 = 448 If you have fewer than the maximum number of D-links connected for a topology, you need to adjust the D-link topology number to match the number of D-links attached to each CEC drawer. Check the number of LR-links: lsnwlinkinfo grep UP_OPERATIONAL egrep LR[0-9] wc l The count should be 192 * [number of CEC drawers], where num CEC drawer = lsnwcomponents grep "FSP Primary" wc -l If the number of operational links does not match the expected number of links use the following to dump the list of BPAs and look for the missing one. It may be that an A side or B side is missing. As long as one is available, the network will be okay. However, the problem should be fixed. lsnwlinkinfo grep UP_OPERATIONAL Or, use the following to search for known links that are not UP_OPERATIONAL : lsnwlinkinfo grep v UP_OPERATIONAL Or, use the following to get all of the known links and then, looking through it, determine which one is missing: lsnwlinkinfo If any actions are taken on the hardware, return to the beginning of Step 2 of this procedure to ensure that nothing was critical disturbed and broken during the repair process. Step 6 Search for miswires: To search for miswires: lsnwmiswire The actual and expected neighbors are listed. Search for situations where cables are swapped. For example, copy the expected neighbor into the following command. If only one line returns, then there is not a cable swap, and the cable end must be moved from the actual neighbor to the expected neighbor. lsnwmiswire grep [expected neighbor] Once all of the miswires have been fixed, return to the beginning of Step 2 to ensure that nothing was critical disturbed and broken during the repair process. p. 110 of 135

111 Step 7: Verify that the hardware global counter is configured Check that a global counter master has been set up: # lsnwgc Verify that a valid Global Counter Master location is displayed. For clusters with fewer than F2C systems, the number of configured backups should be equal to (number of CEC drawers -1). As long as a Global Counter Master is active and there are no hardware issues, the hardware global counter will be synchronized. Problems such as garded ISRs, too many drawers powered down, and too many faulty links can prevent global propagation to some of the octants, particularly in single drawer supernode configurations. If the number of configured backups is fewer than expected, global counter will still function OK. This procedure ends here return to the procedure or document which referenced this procedure Diskless Node Logging Configuration With diskless nodes, logging must be considered carefully. In order to be able to diagnose problems when nodes crash, some of the critical HPC product log files should be put in a global file system so they are available even after the node is down. Another consideration is that some of the log files can drive a large amount of file system traffic, especially when coming from all of the compute nodes around the same time. Therefore, you need to make sure the global file system chosen can handle the traffic as well as volume. A few suggestions for where to put them are provided below, in the order of preference GPFS filesystem - except the GPFS log files themselves An external NFS server sized appropriately to handle the traffic. Consider the network bandwidth, the CPUs, the memory, and the number of hard disks. The service node - if you are going to log all HPC log files to the service node, you will need more internal drives to handle the bandwidth. For using NFS to make logs, trace and dumps persistent, see Persistent logs using NFS, on page 113. In addition, there is a list of references in References for Logging with Diskless Nodes, on page 114. The following table lists the relevant HPC products and describes the log files, how you enable/disable them (if applicable) and how you change their logging location to suit your needs. Also included is who is responsible for configuring particular log or set of logs; the system administrator (Admin) or the end-user (User). The other HPC products not listed in the table do not have logs on diskless nodes. The major tasks will be to point the logs to the global filesystem that you have chosen above. For more detail on each component's logging, see the component's reference documentation. In addition, you will want to refer to the documentation listed below the table. In addition to the HPC product log files, consideration must be given to operating system log files and dumps. For information on AIX, see xcat AIX Diskless node images(reference below) - especially the sections on: Preserving system log files, ISCSI dump support and Preserving AIX ODM configuration data on disklessp. 111 of 135

112 stateless nodes.. Table 13: Diskless Logging Files HPC component PMD POE Coscheduler PNSD SCI HPC Toolkit Files Defa Enable/Disable Changing location ult / Off MP_PMDLOG=yes MP_PMDLOG_DIR=<file> tmp/mplog.<jobid>.<taskid> Or Or Invoke poe with "Invoe poe with "-pmdlog_dir <file>" pmdlog yes" /tmp/pmadjpri Off MP_PRIORITY set and MP_PRIORITY_LOG_DIR ormp_priority_log= priority_log_dir MP_PRIORITY_LOG_NAME or yes -priority_log_name /tmp/serverlog rotated to On Always on /etc/pnsd.cfg has variable to change /tmp/serverlog.old at log file name and max size 10MB / Off Requires a debug SCI_LOG_DIRECTORY for the fe, be tmp/<hostname>.fe.log.< version lib* or scia logs pid> The environment variable only works / with a debug version. tmp/<hostname>.be.log. SCI_LOG_LEVEL affects the logging <pid> level** / tmp/<hostname>.scia.log.<pid> Binary Instrumentation: Off HPCTLOG=[1-4] Use HPC_TEMPDIR to change the /tmp/[hpctinst].bas.log logging directory Responsible Admin User User User User Hardware counter profiling /tmp/[hpm.<pid>.log LoadLeveler OpenMP profiling /tmp/pomp<mpi_rank>. log For details, see Configuring Logs: $(tilde)/log/* Logs Archive: no default recording activity and log files in TWS LoadLeveler: job queue: $(tilde)/spool Using and Administering Executables: $ (tilde)/execute For what is defaulted on/off see information on D_ALWAYS. LOG=<directory> changes the directory Admin to which logs are saved SAVELOGS=<direcotry> changes directory to which logs are archived SPOOL=<directory> changes directory to which job queue is saved EXECUTE=<directory> changes directory to which executables are saved For other log control information see LoadLeveler documentation. p. 112 of 135

113 HPC component GPFS Files /var/mmfs/* /var/log/adm/ras/gpfslog/* Defa Enable/Disable Changing location Responsible ult On See GPFS See Admin documentation for Problem Determination ediawiki/xcat/index.php? title=xcat_aix_diskless_nodes Note: GPFS should log persistent files to an NFS server and should not use the GPFS filesystem * The debug version of the SCI library will be included in the release. On AIX, the debug version library is located in /opt/ibmhpc/${ppe_path}/ppe.sci/libdebug/. The /opt/ibmhpc/${ppe_path} is the new installation path of PE Runtime suite, which includes a version/release identifier as part of the installation path, such as /opt/ibmhpc/pe1100 for PE Runtime suite release ). The open-source compiled library is the debug version. ** SCI_LOG_LEVEL: the level of the logs. This environment variable can only work with debug version. The valid values are: 0 (CRITICAL), 1 (ERROR), 2 (WARNING), 3 (INFORMATION), 4 (DEBUG), 5 (PERFORMANCE), 6 (OTHER). Level 0 is the most critical one. The default value is 3 (INFORMATION). Persistent logs using NFS When using statelite on diskless nodes, two tables are typically used for controlling files that you want to be persistent and should not be logged to GPFS: statelite is the table that points to the NFS server where the persistent data is stored and mounted during a reboot litefile is the table that lists the files or directories that are read/write or persistent. The following is an example of the litefile table listing what you want to be persistent that requires NFS service versus GPFS service. #image,file,options,comments,disable "ALL","/etc/basecust","persistent",, "ALL","/var/adm/ras/errlog","persistent",, "ALL","/var/adm/ras/gpfslog/","persistent",, "ALL","/gpfslog/","persistent",, "ALL","/var/mmfs/","persistent",, "ALL","/var/spool/cron/","persistent",, /etc/basecust = File with records of updates to the ODM. Used to restore the ODM. /gpfslog/ = a user defined directory for storing GPFS traces /var/adm/ras/gpfslog/ = the "/" at the end of the file indicates it is a directory. This will store all files in that directory /var/adm/ras/errorlog = AIX error log /var/mmfs/ = Directory containing the GPFS configuration for each node /var/spool/cron = cronjobs Note: Do not mount the entire /var/adm/ras directory in the statelite environment. NFS can hang when the /var/adm/ras/conslog is mounted. Only mount the required files in /var/adm/ras. p. 113 of 135

114 For more information on diskless nodes with the AIX operating system, see References for Logging with Diskless Nodes Table 14: Diskless Node Logging References Reference xcat Linux Statelite nodes Links/Comments title=xcat_linux_statelite xcat AIX Diskless node images ndex.php?title=xcat_aix_diskless_nodes Parallel Environment Runtime Edition for AIX Look for logging information for each PE component. PMD, Operation and Use POE Co-scheduler, PNSD and SCI Setting up LoadLeveler in Statelite or ndex.php? Stateless Clusters title=setting_up_loadleveler_in_a_statelite_ or_stateless_cluster#add_loadleveler_to_your _diskless_image_2 IBM Tivoli Workload Scheduler LoadLeveler Especially the section on Configuring recording activity and log files. Using and Administering General Parallel File System Problem Determination Guide Setting up GPFS in a Statelite or Stateless Cluster Available via: Especially the section on Logs, traces and dumps. Available via: om.ibm.cluster.infocenter.doc/infocenter.html title=setting_up_gpfs_in_a_statelite_or_stateless_cluste r Placement of LoadLeveler and TEAL GPFS Daemons There are several things to consider in deciding where to run these daemons in a p775 cluster. They have to do with resources that these daemons require and not conflicting with other services that many be running in your cluster. The following sub-sections provide information on placing LoadLeveler daemons (LoadLeveler Daemons, on page 114), TEAL daemons for GPFS (TEAL GPFS Monitoring Connector, on page 115), and how high availability service nodes impact the decision on placement (Highly Available Service Nodes, on page 115) LoadLeveler Daemons Here are the LoadLeveler daemons that you need to place: Central manager and a backup central manager Resource manager and a backup resource manager p. 114 of 135

115 N number of schedd's, when N depends on the rate of jobs typically submitted All of these LoadLeveler daemons can use the cluster database for configuration information or can use traditional configuration files. While use of the cluster database is recommended, to decide whether you are going to use the cluster database for LoadLeveler or not, see the following list of pros and cons: Pros : Automation of setup and configuration changes in one central location LoadLeveler Event information can be made available from the data base via TEAL Rolling update of the cluster software is supported The llconfig command can be used to change/view configuration data Cons: Additional step in installation and set up of LoadLeveler The LoadLeverler daemons listed above must be run on nodes that have access to the database. Currently, this means that the nodes must be diskful and be connected directly to the management ethernet LAN. (All of these LoadLeveler daemons also need to be directly on the HFI network.) There is an additional layer of software to be considered when debugging The schedd's also need to be on nodes that are part of the GPFS application data cluster (or some other global file system) to support the movespool function and checkpoint/restart. TEAL GPFS Monitoring Connector For TEAL, the tlgpfsmon daemon (installed as part of the teal.gpfs-sn package) needs to run on one of the GPFS monitoring collector nodes. It also needs access to the cluster database. Highly Available Service Nodes Service nodes that are not configured to be highly available can be a logical place to run the LoadLeveler and tlgpfsmon daemons, because they are diskful, connected to the management ethernet LAN, and have access to the cluster database. But, if you are running highly available service nodes, the service nodes must be in a GPFS cluster that is separate from the application data GPFS cluster. If this is your case, you can not run the LoadLeveler and tlgpfsmon daemons on the service nodes, because the service nodes can't be in 2 GPFS clusters at the same time. Instead, you will need to carve out some "utility" nodes, likely from the same octants as some of the service nodes. For the utility nodes that are running the LoadLeveler daemons, if LoadLeveler is using plain configuration files, then those utility nodes don't need to have disks and ethernet adapters. They only need to be part of the GPFS application data cluster and be connected to the HFI network. For the utility nodes that are running tlgpfsmon, they must be GPFS monitoring collector nodes and be connected to the cluster database (and therefore have disks and ethernet adapters). p. 115 of 135

116 Barrier Sync Register (BSR) Configuration The system administator assigns privileges to users to allow them to leverage BSR. This is done on compute nodes. Only a root user can enable BSR privileges for a user. Enabling BSR on AIX compute nodes In order to use the BSR, a non-root user has to have the capability to use CAP_BYPASS_RAC_VMM and CAP_PROPAGATE; CAP_NUMA_ATTACH is also recommended for other performance features (e.g. to give users access to rset creation and large pages). To check the BSR capability use: lsuser <user_id> To enable the BSR capability use: chuser capabilities=cap_bypass_rac_vmm,cap_propagate,cap_numa_attach <user_id> Enabling BSR on POWER Linux compute nodes In order to use the BSR, a non-root user must be given permission via the bsr group name, and the group must be an owner of /dev/bsr* and /var/lib/bsr. Starting from PE 1143a, libbsr ver 0.5 is needed, because it contains a fix for mmap failure. Use the following procedure: 1. Assure that libbsr is installed: rpm -qa grep libbsr 2. Check if there is a group named "bsr": cat /etc/group grep bsr 3. If no the bsr does not exist, create the "bsr" group: groupadd bsr 4. Verify that the user is in the bsr group: id <user_id> 5. Verify that /dev/bsr* is owned by boot:bsr": ls -l /dev/bsr* 6. If /dev/bsr* is not owned by boot:bsr, modify the ownership: chown root:bsr /dev/bsr* 7. Verify that /var/lib/bsr exists: ls /var/lib/bsr 8. If /var/lib/bsr does not exist: mkdir -p /var/lib/bsr 9. Modify the owner of /var/lib/bsr: chown root:bsr /var/lib/bsr 10. Modify the mod of /var/lib/bsr: chmod g+sw /var/lib/bsr Performance tuning This topic provides information about the performance tuning. Areas that will be addressed are operating system tunables (AIX or Linux) and Parallel Environment variables as well as GPFS tunables. Most parameters operate at the AIX level, and are set either in the diskless boot image, by an xcat postscript, or in the software component (e.g., GPFS, LL) configuration. The exception is the memory interleave setting, which is accomplished through PHYP. p. 116 of 135

117 Some parameters depend on the intended function of the LPAR or CEC. There are 4 node types/functions: GPFS server node. Contains SAS HBAs connected to one or more P7 Disk Enclosures. Compute node. Generic application node and GPFS client. Login node. GPFS client where users login and launch jobs. Service node. NFS server that provides the diskless nodes with boot images, paging space, and statelite resources. Some of the parameters depend on the size of the p775 cluster as defined by the total number of HFI interfaces on a common HFI fabric. This is less obvious when you have multiple GPFS clusters defined on a single HFI fabric. In that case, you must account for all of the ml0 interfaces for all of the GPFS nodes that share the same HFI fabric, regardless of how many GPFS clusters are defined. Several of the parameters are interdependent. Therefore, care must be taken to balance them. One such example of interdependency encompasses the settings for rpoolsize, spoolsize, TCP socket buffer sizes both in the Network ("no") and GPFS, and the number of large pages. Refer to the performance planning worksheets. AIX tunables This topic provides information about AIX turnables. To support larger jobs, run: chdev -l sys0 -a maxuproc=512 If you are using, the AIX run-queue coscheduler: schedo -p -o shed_primrunq_mload=0 Because processor folding can degrade performance, run the following to disable processor folding: schedo -r -o vpm_fold_policy=0 To allow shared memory segments to be pinned: vmo -r -o v_pinshm=1 If users request large pages, run the following: vmo -r -o lgpg_size= o lgpg_regions=x To disable enhanced memory affinity to quiet down the VMM daemon, run: vmo -r -o enhanced_memory_affinity=0 To enable aggressive hardware prefetch, this can improve network bandwidth, run: dscrctl -s 0x1e b To choose SMT2, this is the preferred mode for HPC, run: smtctl -t 2 -w boot p. 117 of 135

118 Linux Tunables This topic provides information about Linux tunables. Parallel environment tunables This topic provides information about parallel environment tunables. Parallel Environments environment variables affect on performance vary depending on the application. There is typically some experimentation that must occur to find the best combination of settings. While cluster installation is not the time to configure individual application parameters, there is typical a period of testing cluster function and performance. During that time, much can be learned about the performance of a particular configuration, and sets of environment variable suggestions can be created based on classes of applications. For more information, see {Management section on performance and Parallel Environment environment variables} p775 Hardware Memory Controller Setting The p775 hardware memory controller setting is configured through the FSP. How??? GPFS server nodes need to set in "8MC" mode. All other node types need to be set in"2mc mode. These values are set using: PendingMemoryInterleaveMode=1 # for NSD/GPFS nodes (8MC) PendingMemoryInterleaveMode=2 # for all other nodes (2MC) Boot Image tuning for AIX Check this relative to xcat documentation The boot images are configured in the EMS/service node NIM configuration. Two boot images are required, one for GPFS servers and one for compute/login nodes. Service nodes are diskful. Because GPFS server and GPFS client nodes need different boot time vmo and no and other parameter settings and because they require different statelite resources, the GPFS server and GPFS client nodes require different boot images. Need to define vmo and no The GPFS servers need a persistent AIX ODM (the file /etc/basecust) to preserve customized device attributes. It is recommended that compute or other diskless nodes not have a persistent ODM. They should boot with default device attributes. Otherwise, strange situations might arise where what should be interchangeable anonymous compute nodes exhibit different behaviors,because a customized device attribute was changed in one but not in others. p. 118 of 135

119 GPFS server LPAR boot image only: The file /etc/basecust GPFS server and client/compute LPARs: The file /var/adm/ras/errlog GPFS server and client/compute LPARs: The directory /var/mmfs GPFS server and client/compute LPARs: The directory /var/spool/cron What's this list of files?? Only the GPFS server LPAR boot image should have the file /etc/basecust as persistent statelite. The interchangeable generic compute nodes should not be able to individually preserve AIX ODM attributes. Tuning NFS Paging Space with AIX The NFS Paging space is configured in the EMS/service node NIM configuration. The default diskless boot image paging space of 64M is insufficient for a GPFS server. The following are reasons to increase the default service node NFS-based paging space: GPFS servers can ping at least 48 to 64 GB of pagepool and require temporary real swap space to stage the real memory pins. Some benchmark applications use real swap space to run The stnfs filesystem mount of the shared root directory will allocate pages for any runtime changes to files in the shared root (for example: /tmp) Therefore, it is recommended that all nodes should be configured with 1024M of paging space. In addition, be sure that there is enough space on service nodes to accommodate all the required swap space. This is especially important when reassigning service nodes for failover. Diskful node For any command in this document which changes the boot image (e.g., no -r -o arptab_nb and various vmo r commands), the bosboot command is also required to effect the modification of the boot image. For example, bosboot -a -d /dev/ipldevice HFI rpoolsize and spoolsize in AIX The HFI rpoolsize and spoolsize are configured using an xcat postscript. For GPFS servers and service nodes, the maximum value of 512 MiB (0x hex 2 followed by 7 zeros) should be used. For GPFS client compute and login nodes, the rpoolsize and spoolsize value depends on the GPFS/TCP socket buffer size and the number of GPFS server LPARs that the GPFS client must communicate with. For GPFS clients, the HFI rpoolsize and spoolsize should be: p. 119 of 135

120 poolsize = MINIMUM OF (i) and (ii) (i) Number of GPFS servers * socket buffer size (ii) 0x (each client must have at least 64MB) The above value is rounded to the nearest MiB expressed in hex. If the value of (i) is less than 64 MiB, use the minimum value defined in (ii), 64 MiB (0x hex 2 followed by 6 zeros). Note this provides a cushion for any non-gpfs sockets which use the r/spool, because in general, the r/spool may be over committed by 2x. For example, with 3 GPFS servers and a socket buffer of bytes, the calculation results in an rpool/spool of 3 * = MiB. This is less than 64 MiB, so the default of 64 MiB (0x ) should be used for both the HFI rpoolsize and spoolsize on the GPFS client nodes. As another example, with 10 GPFS servers and a socket buffer of bytes, the calculation yields an rpool/spool of 10 * = MiB. This is greater than 32 MiB, so a non-default value of MiB (0x280A000) should be used for both the HFI rpoolsize and spoolsize on the GPFS client nodes. The HFI rpoolsize and spoolsize are set at boot time by an xcat postscript using the following commands: SERVER: /usr/lib/methods/chghfi -l hfi0 -a rpoolsize=0x SERVER: /usr/lib/methods/chghfi -l hfi0 -a spoolsize=0x CLIENT: /usr/lib/methods/chghfi -l hfi0 -a rpoolsize=see_above CLIENT: /usr/lib/methods/chghfi -l hfi0 -a spoolsize=see_above Recall that service nodes should have the maximum pool values, just like GPFS servers. But service nodes are diskful, so the chghfi commands should only need to be executed once and will be preserved in the ODM. The HFI receive and send pools are backed by system VMM large pages, so a minimum number of 16 MiB large pages to cover both pools is required, plus a cushion for other system programs, and those that might be required by user applications. See the "vmo" section. Network ( no ) Tunables Some of these must, and most should, be set using "xcatchroot" in the boot images. ARP table values need to be set in the boot image, as they cannot be changed at runtime. ARP table values are the same in all boot images and on the service node. The number of buckets (arptab_nb) is always 149 (a reasonably large prime number for the hash table). The size of each bucket (arptab_bsiz) depends on the total number of nodes on the HFI fabric. no -r -o arptab_nb=149 -o arptab_bsiz=x Where X is (8 * number of LPARs on the HFI) / 149, with a minimum value of 9. p. 120 of 135

121 Max socket buffer is the same in all boot images (that includes service nodes, along with compute, login and GPFS storage nodes), and must be set in the boot image so that the ml0 driver configures with it. It should be 4 times the GPFS/TCP socket buffer size. If tcp_sendspace and tcp_recvspace are to be set to , the value of sb_max should be no -r -o sb_max= The following should be set prior to GPFS startup. There is no reason not to set them in the boot image, and they must be the same in all boot images (note this includes the images for service, compute, login, and GPFS storage images): no no no no -o -o -o -o rfc1323=1 tcp_recvspace= tcp_sendspace= sack=0 TCP send space and TCP receive space are the default socket buffer sizes for TCP connections. Note that the values used for GPFS socket buffer sizes are defined in the GPFS tuning section Special note for diskful nodes only (eg, service nodes) For any command in this document which changes the boot image (eg, vmo, ), the bosboot command is also required to effect the modification of the boot image. For example, bosboot -a -d /dev/ipldevice GPFS tunables 1) mmchconfig socketrcvbuffersize where run? 2) mmchconfig socketsndbuffersize The value of is chosen to be one quarter of a 16 MiB GPFS data block, plus NSD network checksum (4 MiB plus 4 KiB = = ). This will allow a 16 MiB GPFS data block to be sent or received using 4 buffers, and 8 MiB GPFS data block to be transferred using 2 buffers, and a 4 MiB or less GPFS data block to be transferred using 1 buffer. These values should be no larger than sb_max/4. 3) In order to reduce the overhead of metadata operations, to provide more responsive filesystems, the customer may decide they don't need exact mtime values to be tracked by GPFS. In this case the modification time on a file will be changed when it is closed but it will not be changed each time the file is modified. For each such GPFS filesystem that this behavior is desired for, run: mmchfs device_name_of_gpfs_filesystem -E no p. 121 of 135

122 4) For login nodes and NFS export nodes: mmchconfig maxfilestocache=10000,maxstatcache=50000 node1, node2,..., noden If compute nodes operate on large numbers of files, consider making the above change or similar on them as well. 5) Large system customers may want to consider increasing the GPFS lease to avoid nodes dropping off a filesystem due to a lease timeout, e.g.: mmchconfig leaserecoverywait=120 mmchconfig maxmissedpingtimeout= GPFS Server ODM changes for HBAs and hdisks Note: The 0x (4 MiB) value below assumes a 16 MiB GPFS blocksize. This is almost certainly the case in any P7 GPFS installation. Inside the GPFS server boot image, the following odmdelete/odmadd commands must be run to add the max_coalesce attribute to the hdisks: # odmdelete -o PdAt -q 'uniquetype=disk/sas/nonmpioscsd and attribute=max_coalesce' The delete is just to make sure a second max_coalesce attribute is not added; it's okay if it results in "0 object deleted." Create a file 'max_coalesce.add' with the following contents: PdAt: uniquetype = "disk/sas/nonmpioscsd" attribute = "max_coalesce" deflt = "0x400000" values = "0x10000,0x20000,0x40000,0x80000,0x100000,0x200000,0x400000,0x800000,0xfff000,0x " width = "" type = "R" generic = "DU" rep = "nl" nls_index = 137 The first and last lines of the file 'max_coalesce.add' must be blank. Then run the following odmadd inside the server boot image: # odmadd < max_coalesce.add The above newly-created max_coalesce attribute will have the required default value of 4 MiB (0x400000). The default for the max_transfer hdisk attribute needs also to be changed to 0x using the /usr/sbin/chdef command, which must be run inside the xcatchroot for the server boot image: p. 122 of 135

123 # /usr/sbin/chdef -a max_transfer=0x c disk -s sas -t nonmpioscsd Optionally, the HBA attributes max_commands and max_transfer can have their default values changed. Both the 2-port HBA (uniquetype ea0) and the 4-port HBA (uniquetype f60) require a max_transfer of 0x This can be set with the chdef command in the server boot image chroot environment: # chdef -a max_transfer=0x c adapter -s pciex -t ea0 # chdef -a max_transfer=0x c adapter -s pciex -t f60 The 2-port HBA takes a max_commands of 248 and the 4-port HBA takes a max_commands of 124. Again, these must be set in the change root environment of the GPFS server boot image: # chdef -a max_commands=248 -c adapter -s pciex -t ea0 # chdef -a max_c Passwordless root ssh/scp Every LPAR--server, compute, other--that is to be part of the P7 GPFS cluster, must be able to passwordlessly ssh/scp as root to all the others so that GPFS commands and configuration work. This can be configured either in the boot image or configured through an xcat postscript Installation checklists This topic provides table of tasks that must be completed. The following is a table of tasks that must be completed in order to achieve a successful installation. Some tasks may be optional, but the decision to not perform a task should be well known and accounted. There may be multiple phases in an installation. Each phase should have a checklist with a simple checklist that indicates when all phases are complete. In the Date Completed column, use NA to indicate that a task is not required. Table 10. Installation checklist Installation Checklist (Phase: ) Task Planning tasks and worksheets completed. Use the Cluster Planning Checklist to determine when this task is completed. p. 123 of 135 Date Completed

124 Installation Checklist (Phase: Task ) Date Completed Order Reception All software licenses have been obtained All software media received undamaged (as applicable) All hardware received undamaged Hardware Placing hardware. EMS; Frames; Servers; Disk Enclosures; Communication network cabling; Service LAN cabling; Management LAN Cabling; Public LAN devices and cabling, and so on Management Subsystems Executive Management Server installed (Hardware placed; Base Operating System and updates, including efixes; optional software; tuning for scaling; network connections) xcat configuration complete (file system for install resources; Configure and start DNS, DHCP and conserver; Install and configure DB2; create LoadLeveler userid; Install TEAL; HPC Hardware server; ISNM; LoadLeveler Central Manager) xcat started on EMS Network Management Serviceable Event logging HMC(s) configured in the site table on the EMS Network Management configuration complete Network Management started on the EMS TEAL configuration complete TEAL started on the EMS Copy Full HPC packages xcat client/server certificates and ssh keys configured Create userids for EMS with appropriate authority (admins, operators, IBM service) LANs are isolated (no routing of packets between public network and service and management networks) Firewall configuration Cluster configuration file set up Verify service network configuration and connectivity Verify DHCP configuration on EMS HMCs installed and powered on without DHCP service on HMCs Frames powered on (note the order) Verify all HMCs, FSPs, and BPAs are visable to EMS (MTMS s collected, FSPs are associated with appropriate BPAs) HMC Firmware updated and compatible with Platform and Power firmware Platform firmware updated and compatible with HMC and Power firmware Power firmware updated and compatible with HMC and Platform firmware NTP service configured Service Utility Nodes operating system diskful images available p. 124 of 135

125 Installation Checklist (Phase: ) Task Login and Service Node sync files configured Network install configured for Service and Login nodes Service nodes booted and successfully configured Login nodes booted and successfully configured Storage nodes booted and successfully configured Compute nodes booted and successfully configured All other node types booted and successfully configured On phased install successful xcat Merge of a Just-Deployed Set of Nodes Into an Existing Cluster Central Job Management LoadLeveler Central Manager installed (note where) Configure ODBC Operating Systems Operating System image definitions are created in Cluster Database Operating System diskless image(s) created for compute servers Operating System diskless image(s) created for storage servers Operating System diskless image(s) created for Log-in nodes Operating System disklessimage(s) created for other node types Operating System tunables applied to compute nodes Operating System tunables applied to storage nodes GPFS and Storage Subsystem Service nodes are configured to manage appropriate Storage Nodes Network between service nodes and storages nodes is established Network boot configured between service nodes and compute nodes Storage nodes installed and successfully booted and configured Disk Enclosures installed and successfully configured Configure and start GPFS daemon on storage nodes Compute Nodes Service nodes are configured to manage appropriate Compute Nodes Network between service nodes and compute nodes is established Network boot configured between service nodes and compute nodes Compute nodes installed and successfully booted and configured Job Management LoadLeveler Central Manager installed (note where) Installation Verification All devices power on successfully All devices power off and back on successfully All devices boot successfully All devices go into standby and reboot successfully (if applicable) No outstanding serviceable events being reported No communication network miswires All required communication network links are available p. 125 of 135 Date Completed

126 Installation Checklist (Phase: ) Task All tunables and non-default application and operating system settings are verified to be configured on all systems. netvsp runs successfully Site-specific test applications run successfully Performance metrics expectations are met Performance Tuning Operating System tunables applied Parallel Environment s environment variable profiles created Date Completed Service Local support has service contact numbers and problem procedures IBM product engineering has been contacted to initiate Power 775 Availability Plus monitoring regimen Hardware warranty phase activated Software warranty phase activated Installation Responsibilities by Component While you may be able to determine this by reviewing procedures, it is sometimes desirable to understand the responsibility for installation and bring-up relative to particular components. With that in mind, this section is intended to make it easier to determine who is responsible for installing or bringing up a particular component. This section is divided into software components and hardware components. The information is intended to represent the typical case. Other arrangements may be made during the planning phase. Anything deviating from the typical should be document Installation Responsibilities by Software Component The following table provides a list of software components that cross-references to the team responsible for their installation and bring-up. For details on teams, see Installation Responsibilities, on page 77. The information is intended to represent the typical case. Other arrangements may be made during the planning phase. Anything deviating from the typical should be document. Component Team EMS software Software installation team EMS operating system image Software installation team xcat Software installation team Hardware Server (HWS) Software installation team DB2 Software installation team Cluster Database on EMS Software installation team p. 126 of 135

127 Component Team ISNM and CNM Software installation team TEAL Software installation team LoadLeveler Software installation team HPC stack Software installation team System firmware Software installation team Frame/power firmware Software installation team HMC software/firmware Software installation team SAS Expansion Card firmware for the IBM disk enclosure feature 5854 for the 9125-F2C system Software installation team Disk firmware for the IBM disk enclosure feature 5854 Software installation team for the 9125-F2C system SAS card firmware Software installation team Other adapter firmware (not all adapters require firmware) Software installation team Service node definitions and operating system images Software installation team Diskless node definitions and operating system images Software installation team (compute, IO, utility, login, and so on) GPFS Filesystem installation team Filesystem Filesystem installation team Installation Responsibilities by Hardware Component The following table provides a list of hardware components that cross-references to the team responsible for their installation and bring-up. For details on teams, see Installation Responsibilities, on page 77. The information is intended to represent the typical case. Other arrangements may be made during the planning phase. Anything deviating from the typical should be document. Component Team Frame placement Movers contracted by IBM Frame leveling Hardware installation team HMC hardware Hardware installation team EMS hardware Hardware installation team Management console monitors and keyboards Hardware installation team p. 127 of 135

128 Component Team 9125-F2C systems Pre-installed in frames in manufacturing 9125-F2C disk enclosure features Pre-installed in frames in manufacturing Bulk power supplies Pre-installed in frames in manufacturing Water Cooling Units Pre-installed in frames in manufacturing HFI Cabling HFI cabling installation team Service and Management VLAN ethernet cables Hardware installation team Public LAN ethernet cables Customer installation team Public LAN ethernet devices Customer installation team p. 128 of 135

129 3 Notices This information was developed for products and services offered in the U.S.A. The manufacturer may not offer the products, services, or features discussed in this document in other countries. Consult the manufacturer's representative for information on the products and services currently available in your area. Any reference to the manufacturer's product, program, or service is not intended to state or imply that only that product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any intellectual property right of the manufacturer may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any product, program, or service. The manufacturer may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to the manufacturer. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: THIS INFORMATION IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. The manufacturer may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to Web sites not owned by the manufacturer are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this product and use of those Web sites is at your own risk. The manufacturer may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning products not produced by this manufacturer was obtained from the suppliers of those products, their published announcements or other publicly available sources. This manufacturer has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to products not produced by this manufacturer. Questions on the capabilities of products not produced by this manufacturer should be addressed to the suppliers of those products. All statements regarding the manufacturer's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. p. 129 of 135

130 The manufacturer's prices shown are the manufacturer's suggested retail prices, are current and are subject to change without notice. Dealer prices may vary. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. If you are viewing this information in softcopy, the photographs and color illustrations may not appear. The drawings and specifications contained herein shall not be reproduced in whole or in part without the written permission of the manufacturer. The manufacturer has prepared this information for use with the specific machines indicated. The manufacturer makes no representations that it is suitable for any other purpose. The manufacturer's computer systems contain mechanisms designed to reduce the possibility of undetected data corruption or loss. This risk, however, cannot be eliminated. Users who experience unplanned outages, system failures, power fluctuations or outages, or component failures must verify the accuracy of operations performed and data saved or transmitted by the system at or near the time of the outage or failure. In addition, users must establish procedures to ensure that there is independent data verification before relying on such data in sensitive or critical operations. Users should periodically check the manufacturer's support websites for updated information and fixes applicable to the system and related software. Ethernet connection usage restriction This product is not intended to be connected directly or indirectly by any means whatsoever to interfaces of public telecommunications networks. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information at Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Red Hat, the Red Hat "Shadow Man" logo, and all Red Hat-based trademarks and logos are trademarks or registered trademarks of Red Hat, Inc., in the United States and other countries. UNIX is a registered trademark of The Open Group in the United States and other countries. p. 130 of 135

131 Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Electronic emission notices Class A Notices The following Class A statements apply to the IBM servers that contain the POWER7 processor and its features unless designated as electromagnetic compatibility (EMC) Class B in the feature information. Federal Communications Commission (FCC) statement Note: This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instruction manual, may cause harmful interference to radio communications. Operation of this equipment in a residential area is likely to cause harmful interference, in which case the user will be required to correct the interference at his own expense. Properly shielded and grounded cables and connectors must be used in order to meet FCC emission limits. IBM is not responsible for any radio or television interference caused by using other than recommended cables and connectors or by unauthorized changes or modifications to this equipment. Unauthorized changes or modifications could void the user's authority to operate the equipment. This device complies with Part 15 of the FCC rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation. Industry Canada Compliance Statement This Class A digital apparatus complies with Canadian ICES-003. Avis de conformité à la réglementation d'industrie Canada Cet appareil numérique de la classe A est conforme à la norme NMB-003 du Canada. European Community Compliance Statement This product is in conformity with the protection requirements of EU Council Directive 2004/108/EC on the approximation of the laws of the Member States relating to electromagnetic compatibility. IBM cannot accept responsibility for any failure to satisfy the protection requirements resulting from a non-recommended modification of the product, including the fitting of non-ibm option cards. This product has been tested and found to comply with the limits for Class A Information Technology Equipment according to European Standard EN The limits for Class A equipment were derived for commercial and industrial environments to provide reasonable protection against interference with licensed communication equipment. European Community contact: IBM Deutschland GmbH p. 131 of 135

132 Technical Regulations, Department M456 IBM-Allee 1, Ehningen, Germany Tele: Warning: This is a Class A product. In a domestic environment, this product may cause radio interference, in which case the user may be required to take adequate measures. VCCI Statement - Japan The following is a summary of the VCCI Japanese statement in the box above: This is a Class A product based on the standard of the VCCI Council. If this equipment is used in a domestic environment, radio interference may occur, in which case, the user may be required to take corrective actions. Japanese Electronics and Information Technology Industries Association (JEITA) Confirmed Harmonics Guideline (products less than or equal to 20 A per phase) Japanese Electronics and Information Technology Industries Association (JEITA) Confirmed Harmonics Guideline with Modifications (products greater than 20 A per phase) Electromagnetic Interference (EMI) Statement - People's Republic of China p. 132 of 135

Declaration: This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may need to perform practical action.

133 Declaration: This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may need to perform practical action. Electromagnetic Interference (EMI) Statement - Taiwan The following is a summary of the EMI Taiwan statement above. Warning: This is a Class A product. In a domestic environment this product may cause radio interference in which case the user will be required to take adequate measures. IBM Taiwan Contact Information: Electromagnetic Interference (EMI) Statement - Korea Please note that this equipment has obtained EMC registration for commercial use. In the event that it has been mistakenly sold or purchased, please exchange it for equipment certified for home use. Germany Compliance Statement Deutschsprachiger EU Hinweis: Hinweis für Geräte der Klasse A EU-Richtlinie zur Elektromagnetischen Verträglichkeit Dieses Produkt entspricht den Schutzanforderungen der EU-Richtlinie 2004/108/EG zur Angleichung der Rechtsvorschriften über die elektromagnetische Verträglichkeit in den EU-Mitgliedsstaaten und hält die Grenzwerte der EN Klasse A ein. Um dieses sicherzustellen, sind die Geräte wie in den Handbüchern beschrieben zu installieren und zu p. 133 of 135

134 betreiben. Des Weiteren dürfen auch nur von der IBM empfohlene Kabel angeschlossen werden. IBM übernimmt keine Verantwortung für die Einhaltung der Schutzanforderungen, wenn das Produkt ohne Zustimmung von IBM verändert bzw. wenn Erweiterungskomponenten von Fremdherstellern ohne Empfehlung von IBM gesteckt/eingebaut werden. EN Klasse A Geräte müssen mit folgendem Warnhinweis versehen werden: "Warnung: Dieses ist eine Einrichtung der Klasse A. Diese Einrichtung kann im Wohnbereich Funk-Störungen verursachen; in diesem Fall kann vom Betreiber verlangt werden, angemessene Maßnahmen zu ergreifen und dafür aufzukommen." Deutschland: Einhaltung des Gesetzes über die elektromagnetische Verträglichkeit von Geräten Dieses Produkt entspricht dem Gesetz über die elektromagnetische Verträglichkeit von Geräten (EMVG). Dies ist die Umsetzung der EU-Richtlinie 2004/108/EG in der Bundesrepublik Deutschland. Zulassungsbescheinigung laut dem Deutschen Gesetz über die elektromagnetische Verträglichkeit von Geräten (EMVG) (bzw. der EMC EG Richtlinie 2004/108/EG) für Geräte der Klasse A Dieses Gerät ist berechtigt, in Übereinstimmung mit dem Deutschen EMVG das EG-Konformitätszeichen - CE - zu führen. Verantwortlich für die Einhaltung der EMV Vorschriften ist der Hersteller: International Business Machines Corp. New Orchard Road Armonk, New York Tel: Der verantwortliche Ansprechpartner des Herstellers in der EU ist: IBM Deutschland GmbH Technical Regulations, Abteilung M456 IBM-Allee 1, Ehningen, Germany Tel: tjahn@de.ibm.com Generelle Informationen: Das Gerät erfüllt die Schutzanforderungen nach EN und EN Klasse A. Electromagnetic Interference (EMI) Statement - Russia Terms and conditions Permissions for the use of these publications is granted subject to the following terms and conditions. Personal Use: You may reproduce these publications for your personal, noncommercial use provided that all p. 134 of 135

High perfomance clustering using the 9125-F2C Service Guide

Power Systems High perfomance clustering using the 9125-F2C Service Guide Revision 1.2 p. 1 of 119 p. 2 of 119 Power Systems High perfomance clustering using the 9125-F2C Service Guide Revision 1.2 p.