High perfomance clustering using the 9125-F2C Service Guide

Size: px

Start display at page:

Download "High perfomance clustering using the 9125-F2C Service Guide"

Oscar Ford
5 years ago
Views:

1 Power Systems High perfomance clustering using the 9125-F2C Service Guide Revision 1.2 p. 1 of 119

2 p. 2 of 119

3 Power Systems High perfomance clustering using the 9125-F2C Service Guide Revision 1.2 p. 3 of 119

4 Note Before using this information and the product it supports, read the information in the Safety Notices section and in the IBM Systems Safety Notices manual, G , and the IBM Environmental Notices and User Guide, Z This edition applies to IBM Power Systems 9125-F2C servers that contain the POWER 7 processor Copyright IBM Corporation 2011.US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp p. 4 of 119

5 Table of Contents 1Safety notices...7 2Service High performance clustering using the 9125-F2C Using the Cluster Guides Cluster Guide Revision History Clustering systems by using 9125-F2C Cluster information resources General cluster information resources Cluster hardware information resources Cluster management software information resources Cluster software and firmware information resources Cluster service Start Here User reported problems System log reported problem Problem reported in TEAL Problem reported in Service Focal Point Problem reported in a component or subsystem log Hardware problems Software problems Node will not boot Application crashing diagnosis Performance Problem Diagnosis HFI Network Problems ISNM command results indicating problems Checking routing mode xCAT problems General xcat problems Operating system deployment problems and xcat EMS Failover Actions TEAL Service Procedures Problem with HMC to TEAL reporting path HFI Network Alerts and Serviceable Events HFI Network Alerts and Serviceable Events Overview Acting on an HFI Network Alert: HFI Network Alert Format: Acting on a HFI Network Serviceable Event HFI Network Serviceable Event Format: HFI Network FRU Locations: HFI Serviceable Event Procedures HFI Isolation Procedures: HFI Symbolic Procedures: HFI Network Locations HFI Network Logical Hardware Locations HFI Network Service/Physical Hardware Locations HFI Network Application Locations Mapping between logical hardware and service/physical locations Power 775 Availability Plus actions Availability Plus recovery procedure...61 p. 5 of 119

6 Availability Plus: Recovering a Compute Node Availability Plus: Recovering a non-compute node Availability Plus: Recovering a hub module Availability Plus: Recovering a D-link Availability Plus: Recovering an LR-link Availability Plus: Recovering a failure that impacts a drawer Availability Plus: Recovering a PCIe Slot Gathering data for Availability Plus resource failures Availability Plus: Restoring repaired non-compute nodes Availability Plus: Restoring repaired Compute nodes Hardware Service Locations for 9125-F2C Data Collection HFI Network Event Reference Generic HFI Network FRU lists Notices p. 6 of 119

7 1 Safety notices Safety notices may be printed throughout this guide: DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to people. CAUTION notices call attention to a situation that is potentially hazardous to people because of some existing condition. Attention notices call attention to the possibility of damage to a program, device, system, or data. World Trade safety information Several countries require the safety information contained in product publications to be presented in their national languages. If this requirement applies to your country, a safety information booklet is included in the publications package shipped with the product. The booklet contains the safety information in your national language with references to the U.S. English source. Before using a U.S. English publication to install, operate, or service this product, you must first become familiar with the related safety information in the booklet. You should also refer to the booklet any time you do not clearly understand any safety information in the U.S. English publications. German safety information Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne 2 der Bildschirmarbeitsverordnung geeignet. Laser safety information IBM servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs. Laser compliance IBM servers may be installed inside or outside of an IT equipment rack. p. 7 of 119

8 DANGER When working on or around the system, observe the following precautions: Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid a shock hazard: Connect power to this unit only with the IBM provided power cord. Do not use the IBM provided power cord for any other product. Do not open or service any power supply assembly. Do not connect or disconnect any cables or perform installation, maintenance, or reconfiguration of this product during an electrical storm. The product might be equipped with multiple power cords. To remove all hazardous voltages, disconnect all power cords. Connect all power cords to a properly wired and grounded electrical outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the system rating plate. Connect any equipment that will be attached to this product to properly wired outlets. When possible, use one hand only to connect or disconnect signal cables. Never turn on any equipment when there is evidence of fire, water, or structural damage. Disconnect the attached power cords, telecommunications systems, networks, and modems before you open the device covers, unless instructed otherwise in the installation and configuration procedures. Connect and disconnect cables as described in the following procedures when installing, moving, or opening covers on this product or attached devices. To Disconnect: 1. Turn off everything (unless instructed otherwise). 2. Remove the power cords from the outlets. 3. Remove the signal cables from the connectors. 4. Remove all cables from the devices To Connect: 1. Turn off everything (unless instructed otherwise). 2. Attach all cables to the devices. 3. Attach the signal cables to the connectors. 4. Attach the power cords to the outlets. 5. Turn on the devices. (D005) p. 8 of 119

9 DANGER Observe the following precautions when working on or around your IT rack system: Heavy equipment personal injury or equipment damage might result if mishandled. Always lower the leveling pads on the rack cabinet. Always install stabilizer brackets on the rack cabinet. To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest devices in the bottom of the rack cabinet. Always install servers and optional devices starting from the bottom of the rack cabinet. Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top of rack-mounted devices. Each rack cabinet might have more than one power cord. Be sure to disconnect all power cords in the rack cabinet when directed to disconnect power during servicing. Connect all devices installed in a rack cabinet to power devices installed in the same rack cabinet. Do not plug a power cord from a device installed in one rack cabinet into a power device installed in a different rack cabinet. An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of the system or the devices that attach to the system. It is the responsibility of the customer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock. CAUTION Do not install a unit in a rack where the internal rack ambient temperatures will exceed the manufacturer's recommended ambient temperature for all your rack-mounted devices. Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not blocked or reduced on any side, front, or back of a unit used for air flow through the unit. Consideration should be given to the connection of the equipment to the supply circuit so that overloading of the circuits does not compromise the supply wiring or overcurrent protection. To provide the correct power connection to a rack, refer to the rating labels located on the equipment in the rack to determine the total power requirement of the supply circuit. (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer brackets are not attached to the rack. Do not pull out more than one drawer at a time. The rack might become unstable if you pull out more than one drawer at a time. (For fixed drawers.) This drawer is a fixed drawer and must not be moved for servicing unless specified by the manufacturer. Attempting to move the drawer partially or completely out of the rack might cause the rack to become unstable or cause the drawer to fall out of the rack. (R001) p. 9 of 119

10 Removing components from the upper positions in the rack cabinet improves rack stability during relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a room or building: (R002) (L001) Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack cabinet. When possible, restore the rack cabinet to the configuration of the rack cabinet as you received it. If this configuration is not known, you must observe the following precautions: Remove all devices in the 32U position and above. Ensure that the heaviest devices are installed in the bottom of the rack cabinet. Ensure that there are no empty U-levels between devices installed in the rack cabinet below the 32U level. If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet from the suite. Inspect the route that you plan to take to eliminate potential hazards. Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to the documentation that comes with your rack cabinet for the weight of a loaded rack cabinet. Verify that all door openings are at least 760 x 230 mm (30 x 80 in.). Ensure that all devices, shelves, drawers, doors, and cables are secure. Ensure that the four leveling pads are raised to their highest position. Ensure that there is no stabilizer bracket installed on the rack cabinet during movement. Do not use a ramp inclined at more than 10 degrees. When the rack cabinet is in the new location, complete the following steps: Lower the four leveling pads. Install stabilizer brackets on the rack cabinet. If you removed any devices from the rack cabinet, repopulate the rack cabinet from the lowest position to the highest position. If a long-distance relocation is required, restore the rack cabinet to the configuration of the rack cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent. Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the pallet. (L002) p. 10 of 119

11 (L003) or All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1 laser products. Outside the U.S., they are certified to be in compliance with IEC as a class 1 laser product. Consult the label on each part for laser certification numbers and approval information. CAUTION: This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive, DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information: (C026) Do not remove the covers. Removing the covers of the laser product could result in exposure to hazardous laser radiation. There are no serviceable parts inside the device. Use of the controls or adjustments or performance of procedures other than those specified herein might result in hazardous radiation exposure. p. 11 of 119

12 Data processing environments can contain equipment transmitting on system links with laser modules that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical fiber cable or open receptacle. (C027) CAUTION: This product contains a Class 1M laser. Do not view directly with optical instruments. (C028) CAUTION: Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the following information: laser radiation when open. Do not stare into the beam, do not view directly with optical instruments, and avoid direct exposure to the beam. (C030) Power and cabling information for NEBS (Network Equipment-Building System) GR-1089-CORE The following comments apply to the IBM servers that have been designated as conforming to NEBS (Network Equipment-Building System) GR-1089-CORE: The equipment is suitable for installation in the following: Network telecommunications facilities Locations where the NEC (National Electrical Code) applies The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed wiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation from the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connect these interfaces metallically to OSP wiring. Note: All Ethernet cables must be shielded and grounded at both ends. The ac-powered system does not require the use of an external surge protection device (SPD). The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal shall not be connected to the chassis or frame ground. p. 12 of 119

13 2 Service High performance clustering using the 9125-F2C You can use this information to guide you through the process of servicing the 9125-F2C clusters. It is part of of a series of guides to High Performance clustering using 9125-F2C High performance clustering using the 9125-F2C Planning and Installation Guide. High performance clustering using the 9125-F2C Management Guide. High performance clustering using the 9125-F2C Service Guide. This document is intended to serve as a consolidation point for important information for preparing for service, servicing the cluster and recovering cluster resources after performing service actions IBM High Performance Computing (HPC) Cluster using POWER technology and the Host Fabric Interface with the 9125-F2C server. It will serve as a consolidation point for the documents of the many component and subsystems that comprise a cluster. It will aid the reader in navigating these other documents in an efficient manner. Where necessary, it will recommend additional information to a person who has general skills within the discipline of tasks that is being documented. This document is not intended to replace existing guides for the various hardware units, firmware, operating system, software or applications publications produced by IBM or other vendors. Therefore, most detailed procedures that already exist in other documents will not be duplicated here. Instead, those other documents will be referenced by this document, and as necessary guidance will be given on how to work with generic information and procedures in the other documents. The intended audiences for this document are: HPC clients, including: System, network and site planners System administrators Network administrators System operators Other Information Technology professionals IBM personnel Planner for the cluster and the site System Service Representatives 2.1 Using the Cluster Guides The document sections are roughly in the order in which you will need them. Reference the Table of Contents for an outline of the topics in this document. The following table is a guide to finding topics in the the High performance clustering using the 9125-F2C Cluster Guides: High performance clustering using the 9125-F2C Planning and Installation Guide. High performance clustering using the 9125-F2C Management Guide. High performance clustering using the 9125-F2C Service Guide. p. 13 of 119

14 Once directed to a document for a certain topic, it is good practice to become familiar with the table of contents as a detailed outline of sub-topics within the document. Content highlights Content Description Document Clustering systems by using 9125-F2C A chapter which provides overview information. It is customized within each guide. All of them have references to information resources, and a a brief overview of cluster components, as well as how to use the cluster guides. The planning and installation guide also has a detailed overview of the cluster, its subsystems, its components, and unique characteristics. All Cluster Guides Detailed Overview Planning information Supported devices and software Planning worksheets In depth overview of the cluster, its subsystems, its components, and unique characteristics. This includes planning information and references for all major subsystems and components in the cluster. This is a list of supported devices and software at the time of publication. More up to date information is available in the HPC Central website. (see References) Worksheets to help plan the Planning and Installation Guide Planning and Installation Guide Planning and Installation Guide Planning and Installation Guide Cluster Installation This includes: the following. References to other documentation are frequent. Planning and Installation Guide Cluster management Installation Responsibilities Overview of the Installation Flow Installation steps organized by teams and objectives Detailed Installation procedures for topics not covered elsewhere This includes: the following. References to other documentation are frequent. Introduction to cluster management A cluster management flow HFI Network Management Monitoring the cluster Availability Plus monitoring Data gathering for problem resolution Cluster maintenance Command references especially for HFI network management Management Guide p. 14 of 119

15 Content Description Document Cluster service This includes: the following. References to other documentation are frequent. Introduction to cluster service Tables to narrow down to proper procedures References to detailed procedures documented elsewhere Hardware problem isolation topics Software problem isolation topics Power 775 Availability Plus actions EMS failover references Service Guide Cluster Guide Revision History The following outlines changes to the Cluster Guide. This includes changes across all of the individual guides that comprise the cluster guide. Table 1: Cluster Guide Revision History Revision Guide Changes 1 All Initial release 1.1 Planning and Installation Guide In the Overview: Diskless nodes: New section; information on updating the Cluster Service and Management Networks Updated LoadLeveler configuration information Section on Highly Available Service Nodes and LoadLeveler and TEAL GPFS Barrier-synchronization register overview Availability Plus info on improving system availability In Planning: Service and login nodes supported LoadLeveler Planning on Service Nodes Planning HIghly Available Service Nodes and LoadLeveler and GPFS TEAL monitoring and GPFS Typos and terminology cleanup Installation: Typos and grammar Installation Documentation links updated for TEAL; LoadLeveler Bringup of LoadLeveler in the installation flow Diskelss node logging configuration in the installation flow Placement of LoadLeveler and TEAL GPFS daemons in the installation flow Barrier Sync Register (BSR) Configuration in the installation flow p. 15 of 119

16 Revision Guide Changes Management Guide Changes to command references: chnwfence nwlinkdiag Service Guide Add Start Here section Update User reported problems Update System Log reported problems Update Problem reported in TEAL Update Problem reported in Service Focal Point Update Problem reported in component or subsystem log Add Node will not boot Add Checking routing mode Add xcat problems Add TEAL Service Procedures Add Acting on HFI Network Alert Add Acting on HFI Network Serviceable Event Isolation Procedure updates and additions: HFI_DDG, HFI_LDG, HFINSFP, HFINNBR, HFILONG Add Data collection section Extensive HFI Network Locations updates Add HFI Network Event Reference Terminology updates 1.2 All Add section for Cluster Guide Revision History. Planning and Installation Guide Management Guide Service Guide Power 775 Availability Plus Overview updates: Some numbered list format issues for the Availability Plus overview. Statement regarding when A+ Management begins Command reference updates: Network Management commands: cnm.snap output info OS system commmand reference for network management chghfi changed to chdev add ifhf_dump for AIX add hfi_snap Data collection updates for: ISNM network HFI network or HFI driver specific to a node TEAL p. 16 of 119

17 2.2 Clustering systems by using 9125-F2C Clustering systems by using 9125-F2C provides references to information resources, and a brief overview of cluster components.the cluster. The cluster consists of many components and subsystems, each with an important task aimed at accomplishing user work and maintaining the ability to do so in the most efficient manner possible. The following paragraphs introduce and briefly describe various subsystems and their main components. The compute subsystem consists of: The 9125-F2C systems configured as nodes dedicated to performing computational tasks. These are diskless nodes. Operating System images customized for compute nodes Applications The storage subsystem consists of: 9125-F2C systems configured as IO nodes dedicated to serving the data for the other nodes in the cluster. These are diskless nodes. Operating System images customized for IO nodes. SAS adapters in the 9125-F2C systems which are attached to Disk Enclosures Disk enclosures Global Parallel file system (GPFS ) The communications subsystem consists of: The Host Fabric Interface technology in the 9125-F2C Busses from processor modules to the switching hub in an octant. For more information see, Octant on page 16. Local links (LL-links) between octants in a 9125-F2C. For more information see, Octant on page 16. Local remote links (LR-links) between drawers in a SuperNode. Distance links (D-links) between SuperNodes The operating system drivers The IBM User space protocols AIX and Linux IP drivers The management subsystem consists of: Executive Management Server (EMS) running key management software. Different types of servers might be used. For more details, see <<to be added>>. Operating system on the EMS Utility Nodes used as xcat service nodes. These serve operating systems to local diskless nodes and provide a hierarchical access to hardware and nodes from the EMS console. Extreme Cloud Administration Toolkit (xcat) running on the EMS and service Utility Nodes. For information on xcat, go to xcat website. ( DB2 - running on the EMS and service Utility Nodes. p. 17 of 119

18 Integrated Switch Network Manager (ISNM) running on the EMS Toolkit for Event Analysis and Logging (TEAL) Other Utility nodes are customizable for each site. These utility nodes must be 9125-F2C servers. Login Nodes are required Other site unique nodes such as tape subsystem servers. These unique nodes are optional, but must be 9125-F2C systems. A summary of node types: Compute Nodes: provide computational capability. Compute nodes generally comprise most of the cluster. IO Nodes: provide connectivity to the storage subsystem. The number of IO nodes is driven by the amount of required storage. Utility Nodes provide unique functions. Service Nodes running xcat, which serve operating systems to local diskless nodes and provide a hierarchical access to hardware and nodes from the EMS console. These service nodes are required. Login nodes which provide a log-in gateway into the cluster. These login nodes are required. Other site unique nodes, such as tape subsystem servers. These unique nodes are optional, but must be 9125-F2C systems. Note: The EMS and HMC are considered to be management consoles and not nodes. Key concepts that are introduced with this cluster are: Most of the nodes are diskless and get their operating systems and scratch space served by the service Utility nodes. Diskless boot is performed over the HFI. Power 775 Availability Plus configuration for processors, switching hubs and HFI cables provides extra resource in the cluster to let these components to fail. And not be replaced until the cluster is nearing the possibility of not being able to achieve agreed upon workload capabilities Cluster information resources Cluster information resources provides references to information resources for the cluster, its subsystems and components. The following tables indicate important documentation for the cluster, where to get it and when to use it relative to Planning, Installation, and Management and Service phases of a clusters life. The tables are arranged into categories of components: General cluster information resources, on page 1 Cluster hardware information resources, on page 20 Cluster management software information resources, on page 20 Cluster software and firmware information resources, on page 21 p. 18 of 119

19 General cluster information resources The following table lists general cluster information resources: Table 2. General cluster resources Component Document Plan Install Manage and service* IBM Cluster Information IBM Clusters with HFI and 9125-F2C website HPC Central wiki and HPC Central forum Power Systems High performance clustering using the 9125-F2C Planning and Installation Guide Power Systems High performance clustering using the 9125-F2C Management Guide Power Systems High performance clustering using the 9125-F2C Service Guide IBM HPC Clustering with Power 775 servers - Service Packs portion of the IBM High Performance Computing Clusters Service Packs website ( BM+High+Performance+Computing+Clusters+Service+Packs#IBM HighPerformanceComputingClustersServicePacks- IBMHPCClusteringwithPower775serversServicePacks) The HPC Central wiki enables collaboration between customers and IBM teams. This wiki includes questions and comments. display/hpccentral/hpc+central IBM Fix Central Note: IBM Fix Central should only be used in conjunction with the readme website for IBM Clusters with HFI and 9125-F2C, above. This is because Fix Central may contain code levels that have been verified for other environments, but not for this cluster * M = Management only; S=Service only; X=Both management and service. X X M S X X X X X X X X p. 19 of 119

20 Cluster hardware information resources The following table lists cluster hardware resources: Table 3. Cluster hardware information resources Component Document Plan Install Manage and service Site planning for all IBM systems POWER F2C system Logical partitioning for all systems System i and System p Site Preparation and Physical Planning Guides Site and Hardware Planning Guide Installation Guide for 9125-F2C Servicing the IBM system p 9125-F2C PCI Adapter Placement x x Worldwide Customized Installation Instructions (WCII) IBM service representative installation instructions for IBM machines and features Logical Partitioning Guide x x x x x x Install Instructions for IBM LPAR on System i and System P x IBM Power Systems documentation is available in the IBM Power Systems Hardware Information Center. Any exceptions to the location of information resources for cluster hardware as stated above have been noted in the table. Any future changes to the location of information that occur before a new release of this document will be noted in the HPC Central website Cluster management software information resources The following table lists cluster management software information resources: Table 4. Cluster management software resources Component Document Plan Install Manage and service Hardware Management Console (HMC) xcat Installation and Operations Guide for the HMC x x Operations Guide for the HMC and Managed Systems For xcat documentation, go to xcat x x x x Integrated Switch Network Manager (ISNM) documentation ( /index.php?title=xcat_documentation) This document x x x p. 20 of 119

21 Toolkit For Event Analysis and Logging (TEAL) On sourceforge.net: eal/index.php?title=main_page x x x When installed, on EMS: /opt/teal/doc/teal_guide.pdf IBM Power Systems documentation is available in the IBM Power Systems Hardware Information Center Cluster software and firmware information resources The following table lists cluster software and firmware information resources. Table 5. Cluster software and firmware information resources Component Document Plan Install Manage and service AIX Linux AIX Information Center x x x Obtain information from your Linux distribution source x x x DB2 For information, go to DB2 x x x IBM HPC Clusters Software GPFS: Concepts, Planning, and Installation Guide x x GPFS: Administration and Programming Reference GPFS: Problem Determination Guide GPFS: Data Management API Guide Tivoli Workload Scheduler LoadLeveler for AIX: Installation Guide (SC ) Tivoli Workload Scheduler LoadLeveler for Linux: Installation Guide (SC ) Tivoli Workload Scheduler LoadLeveler: Using and administering (SC ) Tivoli Workload Scheduler LoadLeveler: Command and API Reference (SC ) Tivoli Workload Scheduler LoadLeveler: Diagnosis and Messages Guide (SC ) Tivoli Workload Scheduler LoadLeveler: Resource Manager Guide (SC ) Parallel Environment: Installation x x Parallel Environment: Messages x x Parallel Environment: Operation and Use, Volumes 1 and 2 Parallel Environment: MPI Programming Guide Parallel Environment: MPI Subroutine Reference x x x x x x x x x x x x x x x x The IBM HPC Clusters Software Information can be found at the IBM Cluster Information Center. p. 21 of 119

22 p. 22 of 119

23 2.3 Cluster service This topic provides information about cluster service. It is broken down into multiple sub-topics ranging from problem isolation to recovery and repair procedures. There are general,cluster-wide topics, and topics that are specific to subsystem or component. If you already have sufficient information and knowledge to narrow down to a specific topic or group of topics, the Table of Contents will be the most effective way to get to a topic. If you need more help in narrowing down to a topic, begin with Start Here, on page Start Here Because there are many service topics associated with a cluster, it is important to quickly narrow down to the appropriate topic. Using symptom analysis is the best way to accomplish this. Begin with the following table of symptoms and follow the links to further isolation techniques or service topics. Symptom User reported problem User reported problems, on page 24 Problem reported in a system log 1 on a server (including service nodes and EMS) Topic System log reported problem, on page 24 Problem reported in TEAL Problem reported in TEAL, on page 24 Problem reported in Service Focal Point Problem reported in Service Focal Point, on page 29 Problem reported in a component Problem reported in a component or subsystem log, on page 30 or subsystem log 2 Problem has been reported to Service Focal Point, but not to TEAL Data collection is requested by IBM Problem with HMC to TEAL reporting path, on page 37 Data Collection, on page 77 1 A system log is dependent on the operating system. In AIX, this is accessed using errpt. In Linux, the system log is /var/log/messages. 2 A component or subsystem log is specific to a component. This is not typically a log that should be monitored without support center or engineering direction. p. 23 of 119

24 User reported problems User reported problems can range from specific return code issues to issues that are very generic in description. Use the following table to further narrow isolate the problem. Symptom Topic Application crashing Application crashing diagnosis, on page 32 Performance problem Performance Problem Diagnosis, on page 33 Node will not boot xcat problem xcat problems, on page 36 xcat problems, on page 36. In particular, look for problems with nodes booting System log reported problem A system log is a common log used in an operating system image, or node. In AIX, the system log is accessed via errpt. In Linux, the system log is /var/log/messages. System logs can indicate hardware or software problems. For hardware problems, most are reported to TEAL and the Service Focal Point instance on the HMC that owns the node. For the 9125-F2C, the exception to this rule is most often PCI-E problems. If the hardware problem is reported against the system planar, DCCA, or some other component not on the PCI-E bus, first query TEAL and Service Focal Point to determine if a problem has been reported against the CEC drawer in which the node is populated. If the problem has not already been reported to IBM via Electronic Service Agent, or has not already otherwise been brought to IBM hardware service attention, contact IBM hardware service with the TEAL, Service Focal Point and System log information. If the problem involves a resource that is covered under Availability-Plus, or if you are unsure if it is covered, see Power 775 Availability Plus actions, on page 61. For problems reported to TEAL, see Problem reported in TEAL, on page 24. For problems reported to Service Focal Point, see Problem reported in Service Focal Point, on page 29. If the problem is reported by a software component, contact IBM software service and provide the details of the software log. If the software problem is associated with the HFI driver, collect datan As prescribed in Data Collection, on page Problem reported in TEAL Problems can be reported to TEALfrom several different subsystems. The subsystem can be identified in a TEAL alert field to begin to isolate the problem and reference the correct set of procedures. To determine the subsystem for a specific alert, do the following: If you do not already know the rec_id, get it in one of the following ways: If you are using the brief report, the first field is the rec_id p. 24 of 119

25 If you are using -f text, the rec_id is on the line that begins with rec_id If you are using -f csv, the rec_id is the first field. If you are using -f json, the rec_id is in the keyword:value combination "rec_id": [number] Run: /opt/teal/bin/tllsalert -f text -q rec_id=[rec_id] grep src_name Lookup the src_name in the following table... Note that the first row deals with alerts from multiple sources at around the same time. If this occurs you should always use that procedure, which will refer back to the table to reolve the individual alerts. TEAL src_name Procedure Description Alerts from multiple sources around the same time Alerts reported by multiple sources around the same time, on page 29 LLEventAnalyzer PNSDEventAnalyzer GPFSEventAnalyzer CnmEventAnalyzer SFPEventAnalyzer LoadLeveler Incidents reported to TEAL, on page 25 PNSD Problems reported to TEAL, on page 26 GPFS Incidents reported to TEAL, on page 27 HFI Network problems reported by ISNM to TEAL, on page 28 Service Focal Point events reported to TEAL, on page 28 Incident reported by LoadLeveler Problem reported by PNSD Problem reported by GPFS HFI network problem reported by ISNM Problem reported by Service Focal Point TEAL* Problems detected by TEAL, on page 28 Incident reported by TEAL LoadLeveler Incidents reported to TEAL LoadLeveler incidents reported to TEAL are identified by: src_name : LLEventAnalyzer. First, consider the interaction of problems in the cluster at around the same time. In particular, look for TEAL alerts being reported by other subsystems around the same time that may explain why LoadLeveler reported an incident. Get the creation_time for the LoadLeveler incident: /opt/teal/bin/tllsalert -f text -q rec_id=[rec_id of LL alert] grep creation_time The creation_time will have a format like: :36: You will be using ranges of times to search. Note, that the query requires a dash between the date and the time. If you are instructed to query 5 minutes before and 5 minutes after an alert that was created at :36: , :36: Using TEAL check for a problem reported by Service Focal Point. Check for three minutes before and after the LoadLeveler alert. /opt/teal/bin/tllsalert -f text -q src_name=sfpeventanalyzer p. 25 of 119

26 creation_time>[low-end] creation_time<[high-end] Check for a problem reported by ISNM to TEAL. Check for five minutes before and after the LoadLeveler event. /opt/teal/bin/tllsalert -f text -q src_name=cnmeventanalyzer creation_time>[low-end] creation_time<[high-end] Check for a problem reported by GPFS to TEAL. Check for 2 minutes before and after the LoadLeveler event. /opt/teal/bin/tllsalert -f text -q src_name=gpfseventanalyzer creation_time>[low-end] creation_time<[high-end] Check for a problem reported by PNSD /opt/teal/bin/tllsalert -f text -q src_name=pnsdeventanalyzer creation_time>[low-end] creation_time<[high-end] If another component has reported a problem around the time of the LoadLeveler incident, and it appears to account for the LoadLeveler incident, close out the LoadLeveler alert: /opt/teal/bin/tlchalert id=[rec_id for the LL alert] state=close Then, go to Problem reported in TEAL, on page 24 and begin to resolve the problem as directed. If no other component is reporting a problem around the time of the LoadLeveler incident, refer to LoadLeveler Diagnosis and Messages guide, and Appendix A. Troubleshooting LoadLeveler in Using and Administering. Both documents are listed in the Cluster software and firmware information resources, on page 21. PNSD Problems reported to TEAL PNSD problems reported to TEAL are identified by: src_name : PNSDEventAnalyzer. The only problem currently reported by PNSD to TEAL has to do with an excessive number of dropped packets. Perform the following: Check TEAL for HFI network problems reported around the same time as the PNSD problem. /opt/teal/bin/tllsalert -f text -q src_name=cnmeventanalyzer creation_time>[low-end] creation_time<[high-end] Where, the low-end is about five minutes before the time of the PNSD event and the high-end is about five minutes after the PNSD event. For example, if the PNSD event's creation_time is :36: , the low-end would be :31:00 and the high-end would be :41:30. Note, that the query requires a dash between the date and the time. If there was a problem in the HFI network, close out the PNSD alert, and continue to monitor for a reoccurrence. /opt/teal/bin/tlchalert id=[rec_id for the PNSD alert] state=close p. 26 of 119

27 If the PNSD alert reoccurs without another event around the same time, contact IBM software service and report the details of the alert. If no HFI network events were discovered, check TEAL for hardware problems reported by Service Focal Point to TEAL at around the same time as the PNSD problem. /opt/teal/bin/tllsalert -f text -q src_name=sfpeventanalyzer creation_time>[low-end] creation_time<[high-end] Where, the low-end is about five minutes before the time of the PNSD event and the high-end is about five minutes after the PNSD event. For example, if the PNSD event's creation_time is :36: , the low-end would be :31:00 and the high-end would be :41:30. Note, that the query requires a dash between the date and the time. If there was a problem with hardware, close out the PNSD alert, and continue to monitor for a reoccurrence. /opt/teal/bin/tlchalert id=[rec_id for the PNSD alert] state=close If the PNSD alert reoccurs without another event around the same time, contact IBM software service and report the details of the alert. If there are no HFI network problems, nor hardware problems reported by Service Focal Point, contact IBM software service with the details of the alert. Once the alert has been resolved, close it out using: /opt/teal/bin/tlchalert id=[rec_id for the PNSD alert] state=close GPFS Incidents reported to TEAL GPFS incidents reported to TEAL are identified by: src_name : GPFSEventAnalyzer First, consider the interaction of problems in the cluster at around the same time. In particular, look for TEAL alerts being reported by other subsystems around the same time that may explain why GPFS reported an incident. Get the creation_time for the LoadLeveler incident: /opt/teal/bin/tllsalert -f text -q rec_id=[rec_id of GPFS alert] grep creation_time The creation_time will have a format like: :36: You will be using ranges of times to search. Note, that the query requires a dash between the date and the time. If you are instructed to query 5 minutes before and 5 minutes after an alert that was created at :36: , :36: Using TEAL check for a problem reported by Service Focal Point. Check for three minutes before and after the GPFS alert. /opt/teal/bin/tllsalert -f text -q src_name=sfpeventanalyzer creation_time>[low-end] creation_time<[high-end] Check for a problem reported by ISNM to TEAL. Check for five minutes before and after the GPFS event. /opt/teal/bin/tllsalert -f text -q src_name=cnmeventanalyzer p. 27 of 119

28 creation_time>[low-end] creation_time<[high-end] If no other component is reporting a problem around the time of the GPFS incident, refer to GPFS error documentation to resolve the problem. HFI Network problems reported by ISNM to TEAL HFI network problems reported by ISNM to TEAL are identified by: src_name : CNMEventAnalyzer HFI network problems reported by ISNM are classified as hardware or software problems. Software problems are identified by an alert_id that begins with BD00. All other alert_ids refer to hardware problems. If a software problem is reported by ISNM, contact IBM software service. All HFI network problems reported against hardware are associated with Availability-plus resources. If the problem is related to hardware, review Power 775 Availability Plus actions, on page 61. If you are unsure if the problem is associated with Availability-plus resources, refer to that section, as well. Do not close out serviceable events associated with Availability-plus resources unless instructed to do so by IBM hardware service. It is imperative that these events remain open until a repair is made, because they are critical in tracking the refresh threshold for Availability-plus. For details on problem resolution for HFI network problems, see HFI Network Alerts and Serviceable Events, on page 38. Hardware problems are the responsibility of the IBM System Service Representative (SSR). Once the problem has been resolved, close the alert using the following command. If the problem is associated with an Availability-plus resource and no repair is performed it is imperative that the alert remain open until a repair is made. This is critical in keeping track of the refresh threshold for Availability-plus. /opt/teal/bin/tlchalert --id=[rec_id] state=close Service Focal Point events reported to TEAL Service Focal Point events reported to TEAL are identified by: src_name: SFPEventAnalyzer. A serviceable event reported by Service Focal Point to TEAL should be reported to IBM hardware service. The problem should be worked by IBM service and the problem closed in Service Focal Point. There is no need to close the problem in TEAL, because it will automatically be closed when the serviceable event is closed in Service Focal Point. Treat the alert as if it were reported by Service Focal Point and follow the procedure in Problem reported in Service Focal Point, on page 29. Problems detected by TEAL Problems detected within the TEAL framework are identified by: src_name: TEAL* For problems detected by TEAL, first determine if it is informational only (such as TEAL starting). If the severity is I, then the alert is informational. A typical example of an informational alert from TEAL is when TEAL starts p. 28 of 119

29 (alert_id=tl000001). Follow the recommendation of the alert, as found using: /opt/teal/bin/tllsalert -f text -q rec_id=[rec_id] Typically, the recommendation for an informational alert will be None, in which case, you should close out the alert using: /opt/teal/bin/tlchalert --id=[rec_id] --state=close Problems of severity = E (or Error) are more severe and you should contact IBM software service. With the detailed information for the alert. You may obtain the detailed information using: /opt/teal/bin/tllsalert -f text -q rec_id=[rec_id] Alerts reported by multiple sources around the same time If multiple alerts are reported around the same time and they come from multiple sources, there is an order in which you should attend to them because it is possible that alerts from one subsystem may indicate an event that has affected another subsystem causing the affected subsystem to generate an alert. The order in which to check and resolve problems follows. As you address each alert, refer back to the table of reporting components and subsystems in Problem reported in TEAL, on page 24. Use about a 5 minute window of time for relating problems. Problems reported by Service Focal Point to TEAL. These have a src_name=sfpeventanalyzer Problem sreported by ISNM to TEAL. Check for five minutes before and after the LoadLeveler event. /opt/teal/bin/tllsalert -f text -q src_name=cnmeventanalyzer creation_time>[low-end] creation_time<[high-end] Check for a problem reported by GPFS to TEAL. Check for 2 minutes before and after the LoadLeveler event. /opt/teal/bin/tllsalert -f text -q src_name=gpfseventanalyzer creation_time>[low-end] creation_time<[high-end] Check for a problem reported by PNSD /opt/teal/bin/tllsalert -f text -q src_name=pnsdeventanalyzer creation_time>[low-end] creation_time<[high-end] Check for a problem reported by LoadLeveler /opt/teal/bin/tllsalert -f text -q src_name=lleventanalyzer creation_time>[low-end] creation_time<[high-end] Problem reported in Service Focal Point For problems reported by Service Focal Point, you may refer to the InfoCenter article on the Service Reference Code (SRC) or refcode associated with the problem. For problems that are associated with Availability-plus resources, see Power 775 Availability Plus actions, on page 61. If you are unsure if the problem is associated with Availability-plus resources, refer to that section, as well. Do not close out serviceable events associated with Availability-plus resources unless instructed to do so by IBM hardware service. It is imperative that these events remain open until a repair is made, because they are critical in tracking the refresh threshold for Availability-plus. p. 29 of 119

30 If you prefer to work with TEAL, you may find the serviceable events in the TEAL alerts by searching for the problem number included in the raw_data of the alerts. Execute the following: Run: /opt/teal/bin/tllsalert -f csv grep "'Problem Number': <problem number from SFP>}" Record the rec_id, which is the first field in what is returned. To get the details from the TEAL alert, run: /opt/teal/bin/tllsalert -f text -q rec_id=<rec_id> For example, if the problem number in the serviceable event were 5, the following would be done: >/opt/teal/bin/tllsalert -f csv grep "'Problem Number': 5}" 3116, , :00: ,E,N,U78AC S000,P,,,Power/Cooling subsystem & control (0x60) reported an error.,sfpeventanalyzer,1,"{'fru List': [['IQYRIEC', 'ACT04219I Isolate procedure', '', '', '', ''], ['45D9988', 'ACT04216I FRU', 'U78AC S000-P2-A2', '', '', ''], ['45D9084', 'ACT04216I FRU', 'U78AC S000-P2-C1', 'YH10SAN50019', '', '2C82']], 'SFP': 'c250hmc09.ppd.pok.ibm.com', 'Problem Number': 5}" ># rec_id=3116 >/opt/teal/bin/tllsalert -f text -q rec_id=3116 =================================================== rec_id : 3116 alert_id : creation_time : :00: severity : E urgency : N event_loc : U78AC S000 event_loc_type : P fru_loc : None recommendation : reason : Power/Cooling subsystem & control (0x60) reported an error. src_name : SFPEventAnalyzer state : 1 raw_data : {'FRU List': [['IQYRIEC', 'ACT04219I Isolate procedure', '', '', '', ''], ['45D9988', 'ACT04216I FRU', 'U78AC S000-P2-A2', '', '', ''], ['45D9084', 'ACT04216I FRU', 'U78AC S000-P2-C1', 'YH10SAN50019', '', '2C82']], 'SFP': 'c250hmc09.ppd.pok.ibm.com', 'Problem Number': 1908} There is no need to close any alert associated with a serviceable event reported by Service Focal Point to TEAL. When the problem is closed in Service Focal Point, it is automatically closed in TEAL Problem reported in a component or subsystem log If the problem has been reported by the HFI device driver, or appears to be associated with the HFI network and there is no hardware event reported in TEAL or Service Focal Point, contact IBM software service, and collect data for HFI problems as directed in Data Collection, on page 77. IBM service may wish to specify additional data collection as necessary. p. 30 of 119

31 For problems reported by other components, refer to their messages and diagnostics guides; see Cluster information resources, on page 18. Also, see Data Collection, on page 77 for information on data collection for a given component Hardware problems This topic provides information about hardware problems. This is intended to help those who do not have a symptom in hand and cannot begin with the procedure in Start Here, on page 23. Instead, the reader is looking to see if there might be a hardware problem. Look for problems identified in: Service Focal Point see Problem reported in Service Focal Point, on page 29 TEAL see Problem reported in TEAL, on page 24 Error logs, which could be the AIX Error Log or Linux syslog. Typically, hardware problems that are found in error logs will also be found in TEAL or Service Focal Point. If the problem is not found in TEAL or Service Focal Point, call your next level of support with the log information. LEDs if hardware has a particularly severe problem that impacts its ability to report a problem, sometimes LEDs are the only method for diagnosis. LED information is available in the 9125-F2C hardware service guide in InfoCenter Software problems This topic provides information about software problems. This is intended to help those who do not have a symptom in hand and cannot begin with the procedure in Start Here, on page 23. Instead, the reader is looking to see if there might be a software problem. Software problems associated with ISNM call your next level of support. Also, see Data Collection, on page 77 for information on data collection for a given component Software problems associated with TEAL see see Problem with HMC to TEAL reporting path, on page 37, or call your next level of support and describe the symptoms that you are seeing. Also, see Data Collection, on page 77 for information on data collection for a given component. Software problems associated with xcat see xcat documentation on sourceforge.net All other software problems should use the individual software component's messages and diagnostics guides; see Cluster information resources, on page 18. Also, see Data Collection, on page 77 for information on data collection for a given component. p. 31 of 119

32 2.3.4 Node will not boot There are several reasons why a node will not boot. A hardware issue with the node hardware If it is a diskless node, it can be: An issue with the service node or the disks that service and contain the operating system for the node. A severe HFI network issue preventing the service node from reaching the diskless node If it is a diskful node, the problem is probably internal to the node itself. First check TEAL on the EMS or Service Focal Point on the HMC that owns the node hardware for a hardware problem reported in the CEC drawer with the node. If a hardware problem is found and it was not addressed previously,use the appropriate procedure in Problem reported in TEAL, on page 24 or Problem reported in Service Focal Point, on page 29. Recall, if there is a hardware problem, it may need an Availability-plus recovery action taken against it. See Power 775 Availability Plus actions, on page 61. If the node is diskful, and there has been no associated problem reported to TEAL or Service Focal Point, check the boot progress of the node and use the server and operating system documentation to determine if the boot boot progress indicates an issue. If the node is diskless, and there is no apparent hardware problem, perform the checks in Application crashing diagnosis If an application is crashing, check the following: 1. Check TEAL for alerts around the time of the problem. Check TEAL for alerts and take recommended service actions. Pay attention first to any LoadLeveler events that are relevant to the job that crashed. Look for any events related to the resources that the job was running on at the time the application crashed. Look for any GPFS reported problems around the time of the crash Note: For any alerts reported in TEAL, see Problem reported in TEAL, on page Check for log entries on the nodes on which the application was running around the time of the problem. For any problems found in particular components, use the individual software components' message and diagnosis guides. See Cluster software and firmware information resources, on page Using ISNM, perform a network health check outlined in the High perfomance clustering using the 9125-F2C Management Guide in the HFI network health checks section. If a problem is found, use procedures found in HFI Network Problems, on page Use individual software components diagnostics techniques. See Cluster software and firmware information resources, on page 21. p. 32 of 119

33 2.3.6 Performance Problem Diagnosis If performance problems have been reported, check the following: 1. Check TEAL for alerts around the time of the problem Check TEAL for alerts and take recommended service actions. Pay attention first to any LoadLeveler events that are relevant to the job that crashed. Look for any events related to the resources that the job was running on at the time the application crashed. Look for any GPFS reported problems around the time of the crash Note: For any alerts reported in TEAL, see Problem reported in TEAL, on page Verify that the job experiencing a performance problem is not running a node with deconfigured resources. If the job is running on a node that has deconfigured resources, ensure that the job parameters to LoadLeveler do not allow for running on node with deconfigured resources, or remove the node from the LoadLeveler job pool. Use rinv [noderange] deconfig The following example is a case of no deconfigured resources. Any deconfigured resources will have a location code beyond -P1, such as U78A N005-P1-R1. -> rinv cec08 deconfig cec08: Deconfigured resources cec08: Location_code RID Call_Out_Method Call_Out_Hardware_State TYPE cec08: U78A N005-P1 800 The following example shows two deconfigured resources in a CEC. They are P1-R39 and P1-R8. For location information, see Hardware Service Locations for 9125-F2C, on page 75. -> rinv cec01 deconfig cec01: Deconfigured resources cec01: Location_code RID Call_Out_Method Call_Out_Hardware_State TYPE cec01: U78A B001-P1 800 cec01: U78A B001-P1-R39 101f SYSTEM DECONFIGURED L2CTLR_FU cec01: U78A B001-P1-R SYSTEM DECONFIGURED Invalid_FU HFI Network Problems HFI Network problems can be reported in several ways: TEAL alerts see Problem reported in TEAL, on page 24. Serviceable events on the HMC see Problem reported in Service Focal Point, on page 29 ISNM command results indicating problems see ISNM command results indicating problems, on page 34 p. 33 of 119

34 ISNM command results indicating problems For details on ISNM commands, see the Command references section of the High perfomance clustering using the 9125-F2C Management Guide. Sub-sections follow for each command and how to resolve problems reported by it. They are in alphabetical order by command. lsnwcomponents problems The lsnwcomponents command can indicate problems with ISNM recognizing hardware components. The results of the command is a list of BPAs and FSPs indicating which are primary and which are backup, as well as IP addressing, location and MTMS information. 1. If a component is missing, check for ethernet problems between the EMS and the component. 2. If multiple components are missing from the results, look for a pattern. If they are all in the same frame, or are all in the same subnet, this should point you to a root cause problem. No components can be seen. This is likely to be a problem in the EMS connection to the cluster management network. All components in a frame is likely a problem in the network between the BPan And the EMS. All FSPs in the same frame and you can see the BPA, then it is likely a problem in the ethernet between the BPan And the FSPs. All of the components in a subnet cannot be seen. This is likely to be an issue with the network hardware in that subnet. 3. If a components' information has changed, consider the following: The relationship between primary and backup can change. Typically, this is not a problem, as long as there is a primary. If the relationship changes often, then there may be a problem waiting to happen. In that case, contact your next level of support. If the IP addressing has changed, does it make sense based on any management network configuration changes. If the MTMS information has changed, was there a service action on the CEC drawer or BPA containing the component. This MTMS should not have changed even after a service action. If it has changed, go to the next step, below. 4. If the component's information should not have changed, do the following: If the MTMS has changed, contact your next level of support and indicate that the MTMS has changed. If the addressing has changed, check for problems with DHCP. If the location has changed, contact your next level of support. lsnwconfig problems A problem revealed in lsnwconfig is most likely a configuration or installation issue. To resolve the problem, set the problem configuration parameter to the correct value. To do this, see the chnwconfig command reference in the Command references section of the High perfomance clustering using the 9125-F2C Management Guide. lsnwdownhw problems For lsnwdownhw problems, do the following. p. 34 of 119

35 Note: If any service action is deferred as part of Availability-Plus, then record this for future reference. For details on the status returned on links, see the lsnwlinkinfo command reference in the Command references section of the High perfomance clustering using the 9125-F2C Management Guide. 1. Check your records to determine if the problem has been deferred as part of Availability-Plus. If it has been, then you may ignore the problem at this time. 2. Check TEAL for any alerts reported against the location that has a problem. Any format for tllsalert, other than the brief format, will have the information required. For ISR or HFI problems and LL-link problems, look at the event_loc field to match with the hub or link that has a problem. If no TEAL alert is found, record the results of lsnwdownhw and call your next level of service. For D-link and LR-Link problems, start with the event_loc field to match the link that has a problem; keep in mind that TEAL uses LD for D-links, and ISNM uses D. Because links have neighbors, you should also match the location information in the reason field. The reason field does not use the hardware logical location code format used by ISNM. It spells out the information, such as frame FR001, cage CG03, and so on. If no TEAL alert is found, record the results of lsnwdownhw and call your next level of service and record. lsnwexpnbrs problems If lsnwexpnbrs does not return the proper neighbors, it is best to run lsnwmiswires and then follow the procedure for problems reported by lsnwmiswires (see lsnwmiswires problems, on page 36. lsnwlinkinfo problems For details on the status returned on lsnwlinkinfo, see the lsnwlinkinfo command reference in the Command references section of the High perfomance clustering using the 9125-F2C Management Guide. Note: If any service action is deferred as part of Availability-Plus, then record this for future reference. 1. Check your records to determine if the problem has been deferred as part of Availability-Plus. If it has been, then you may ignore the problem at this time. 2. Check TEAL for any alerts reported against the location that has a problem. Any format for tllsalert, other than the brief format, will have the information required. For LL-link problems, look at the event_loc field to match with the hub or link that has a problem. If no TEAL alert is found, record the results of lsnwlinkinfo and call your next level of service. For D-link and LR-Link problems, start with the event_loc field to match the link that has a problem; keep in mind that TEAL uses LD for D-links, and ISNM uses D. Because links have neighbors, you should also match the location information in the reason field. The reason field does not use the hardware logical location code format used by ISNM. It spells out the information, such as frame FR001, cage CG03, and so on. If no TEAL alert is found, record the results of lsnwlinkinfo and call your next level of service. lsnwloc problems Refer to the status descriptions for lsnwloc in the Command references section of the High perfomance clustering p. 35 of 119

36 using the 9125-F2C Management Guide. lsnwmiswires problems To resolve problems reported by lsnwmiswires, see Procedure to resolve network miswires: in the High perfomance clustering using the 9125-F2C Management Guide. lsnwtopo problems If there is a problem in the results of lsnwtopo, do the following: 1. If lsnwtopo, lsnwtopo -C and lsnwtopo -an All return the same, but incorrect topology, or lsnwtopo returns the incorrect topology, then the Cluster Database must be updated and CNM restarted: chdef -t site -o mastersite topology=[topology string] Restart CNM 2. If lsnwtopo -C does not match lsnwtopo, restart CNM and query lsnwtopo -C again. If the problem persists, call your next level of support. 3. If lsnwtopo -A returns one or more incorrect topologies, use chnwsvrconfig to download the topology from CNM to the problem FSP(s) again. For details on using chnwsvrconfig to target one or more FSPs, see the chnwsvrconfig command reference in the Command references section of the High perfomance clustering using the 9125-F2C Management Guide Checking routing mode cat /sys/class/net/hf?/rtmode 0x x x x > direct 1 ---> sw indirect 2 ---> multidirect 3 --> hw indirect xcat problems XCAT problems described here are broken into three categories in the following table: Symptom/Problem Topic General xcat problems General xcat problems, on page 37 Operating system deployment problems and xcat Operating system deployment problems and xcat, on page 37 p. 36 of 119

37 Symptom/Problem An EMS has failed and requires failover recovery actions Topic EMS Failover Actions, on page General xcat problems For information on debugging general xcat problems and service node booting problems see: This article includes information on what to check to determine isolate a problem with xcat. It also includes information on operating system deployment problems included possible issues with service nodes Operating system deployment problems and xcat For information on debugging problems with deploying operating systems to nodes using xcat, see title=debugging_xcat_problems#how_to_debug_a_general_os_deployment_process EMS Failover Actions For the most up to date information on failing over the EMS, see: TEAL Service Procedures The following sub-sections are used for resolving issues with TEAL. If you cannot find an appropriate sub-section, contact your next level of support with the problem symptoms that you are seeing. Also, see Data Collection, on page 77 for information on data collection for a given component Problem with HMC to TEAL reporting path If there is a problem with the HMC to TEAL reporting path, perform the following procedure to (re)initialize the proper RMC functions for this path. 1. Deconfigure rmcmon mondecfg rmcmon hmc -r mondecfg rmcmon hmc 2. Reconfigure rcmmon moncfg rmcmon hmc -r p. 37 of 119

38 moncfg rmcmon hmc 3. Start rmcmon monstart rmcmon hmc 4. Set up the condition/response startcondresp "AllServiceableEvents_HB" "TealLogSfpEvent" HFI Network Alerts and Serviceable Events HFI Network Alerts and Serviceable Events Overview HFI network alerts and serviceable events are used to identify HFI network problems. Alerts are found in the TEAL alert database table, while serviceable events are found on an HMC under Manage Serviceable Events. The TEAL alerts are primarily used by administrators and operators, while the serviceable events are meant for System Service Representatives (SSR). Any hardware oriented event should be found in both spots. If there is a software issue it might only be found in the alert database. This depends on whether or not the software issue was found while processing a hardware event and caused the analysis of the hardware event to be compromised, in which case either a single alert with a corresponding serviceable event will be generated and placed in both spots, or the problem may require two alerts to be generated (one hardware and one software) with one having a corresponding serviceable event. In rare cases, it is possible that not enough information is known about the affected hardware and a single software event will only be logged as an alert. HFI network events are 8 hexadecimal digits starting with BD. For example, BD represents a Port Lane Width Change on a D-link port. If you recognize an HFI network event in a TEAL alert, start with Acting on an HFI Network Alert:, on page 38. If you recognize and HFI network event as a serviceable event on and HMC, start with Acting on a HFI Network Serviceable Event, on page Acting on an HFI Network Alert: You should be here because you have recognized an HFI Network Alert reported in the TEAL alert database. In order to resolve this issue, first record the following information: The alert_id will give a code that indicates a specific alert. This matches the system reference code (SRC) reported as a serviceable event. The reason provides a short description of the problem The recommendation provides a brief description of how to respond to the problem. The event_loc provides location information on the component that reported the problem. The rec_id is used to reference a specific alert. The raw_data is very specific to HFI network alerts and provides some more important information about the alert. See Acting on an HFI Network Alert:, on page 38. Once you have recorded the information. Review it in the following order: Recommendation p. 38 of 119

39 fru_list from the raw_data Pay particular attention to the first FRU, which is an isolation procedure. The isolation procedures are found in HFI Isolation Procedures:, on page 43 as part of HFI Serviceable Event Procedures, on page 43. Other procedures may be reported in the form of Symbolic Procedures. These are found in HFI Symbolic Procedures:, on page 50 as part of HFI Serviceable Event Procedures, on page 43. alert_id Typically, HFI Network alerts will require hardware service to be involved. The administrator should open a problem to IBM hardware service as instructed in the recommendation. In addition, the administrator should run whatever Fail in place procedures are required for recovery of the problem. Recommendations may refer to hardware maintenance, diagnostic or service procedures. If they do, these are intended to be informational for all but the hardware service team. There may be procedures that require the administrator to contact IBM software service. These are typically configuration or software problems. Again, the recommendation should direct you HFI Network Alert Format: Besides the typical, key fields for alerts (event ID, reason, recommendation, creation_time, etc..), the alert raw data contains service information. The following is the list of service information, with the associated field names in parentheses. The fru list (fru_list:) is used to indicate what is involved in repairing the problem. It can be a mixture of hardware and diagnostic procedures. The neighbor location (nbr_loc and nbr_type) is used to indicate the neighbor to the location that detected the event(s) associated with the alert. This is not always applicable. The nbr_type is typically H for hardware. The power controlling enclosure mtms (pwr_enc) is the Machine Type-Model-Serial number (MTMS) of the enclosure that has the power source for the device that reported the problem. In all cases, this should be BPA in the frame. The extended error data location (eed_loc) is used to indicate where more in-depth logs and traces can be found. These files are used for engineering debug. The logic controlling enclosure mtms (encl_mtms) is the MTMS of the enclosure that contains the device that reported the problem. p. 39 of 119

40 Each field is introduced with the field name followed by a colon (typical JSON format). Large fields of raw datan Are contained within braces ( { and } ), with ordered lists of sub-fields in braces as well. For example, the following illustrates the raw_data format: raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, }, { HFI_CAB,Symbolic Procedure,U78A P1-T17-T4,,, }, { CBLCONT,Symbolic Procedure,U78A P1-T17-T6,,, }, { 45D1111,FRU,U78A P1-R3,YL55555,ABC123,TRMD }, { 45D1111,FRU,U78A P1- R2,YL55666,ABC123,TRMD }","nbr_loc":"h","nbr_typ":"bb03-fr007-sn017-dr0-hb2- LD15","pwr_enc":" BPCF007","eed_loc":"c250mgrs14:/var/opt/isnm/cnm/log","encl_mtms":"9125- F2C/P7IH149"} The fru_list is an ordered list of FRUs that are associated with the problem, as well as a reference to any procedures associated with isolating the problem (see HFI Serviceable Event Procedures, on page 43). Each FRU is contained within braces, and they are comma-separated. The FRUs themselves are compromised of several fields, each comma-separated: Part number or procedure name FRU Class, which identifies if this is an actual FRU, an isolation procedure used to isolate the problem, or a symbolic procedure used to represent an actual FRU that has no VPD associated with it. For procedures see HFI Serviceable Event Procedures, on page 43. FRU Location which uses the typical IBM POWER architecture's service or physical location codes; see HFI Network FRU Locations:, on page 42. FRU Serial Number of the part. FRU ECID is the engineering change ID for the part, which helps in determining a minor revision in a part that did not result in a part number change. FRU CCIN is a custom identification number used to identify the part. Therefore, the above example breaks down into: Field Value Comment fru_list nbr_loc See next table B03-FR007-SN017-DR0-HB2-LD15 nbr_type H Hardware location pwr_enc BPCF007 BPA MTMS eed_loc c250mgrs14:/var/opt/isnm/cnm/log Hostname and path to the directory containing more detailed logs and traces required for engineering debug. encl_mtms 9125-F2C/P7IH149 Enclosure MTMS for the containing F2C system p. 40 of 119

41 The FRU list is as follows: Part Number or Procedure FRU Class Location Code HFI_DDG HFI_CAB CBLCONT Isolation Procedure U78A P1- T17-T4 U78A P1- T17-T6 Serial Number ECID CCIN Comments A procedure with no location or other VPD. A cable connector with no VPD, just a location. The other end of the cable. 45D1111 U78A P1-R3 YL55555 ABC123 TRMD A hub module with part VPD used for identifying a replacement part. 45D1111 U78A P1-R2 YL55666 ABC123 TRMD The hub module on the other end of the cable. Note the above is just an example, and actual values may vary Acting on a HFI Network Serviceable Event An HFI Network serviceable event is reported to a designated HMC or its backup. In addition there will be a TEAL alert. TEAL alerts for HFI network events are discussed in Acting on an HFI Network Alert:, on page 38. In order to resolve this issue, first record the following information: The SRC or refcode The problem description provides a short description of the problem, including location information The FRU list Once you have recorded the information. Review it in the following order: FRU list Pay particular attention to the first FRU, which is an isolation procedure. The isolation procedures are found in HFI Isolation Procedures:, on page 43 as part of HFI Serviceable Event Procedures, on page 43. Other procedures may be reported in the form of Symbolic Procedures. These are found in HFI Symbolic Procedures:, on page 50 as part of HFI Serviceable Event Procedures, on page 43. SRC or refcode Typically, HFI Network alerts will require hardware service to be involved. The administrator should open a problem to IBM hardware service as instructed in the recommendation. In addition, the administrator should run whatever Fail in place procedures are required for recovery of the problem. Recommendations may refer to hardware maintenance, diagnostic or service procedures. If they do, these are intended to be informational for all but the hardware service team. p. 41 of 119

42 There may be procedures that require the administrator to contact IBM software service. These are typically configuration or software problems. Again, the recommendation should direct you HFI Network Serviceable Event Format: The serviceable event format follows the typical serviceable event format architected for all Power systems. An important point is that location information is embedded into the problem description to make it easier to find the parts to be repaired and understanding the impact to applications. They key piece of information is the frame, cage, supernode and drawer information. For exmple: D Link Port Lane Width Change between frame FR007 supernode SN032 drawer DR0 hub HB1 port LD15 and frame FR007 supernode SN017 drawer DR0 hub HB2 port LD15 (D Link Port Lane Width Change) The hub and port information is most often used for engineering debug purposes. When troubleshooting an HFI Network Serviceable Event perform the following: Bring up the list of events on the primary Event Analysis (EA) HMC for the network manager. You can determine this by logging on to the EMS and typing the following command. A list of HMCs will be presented with the primary being first and backups following. [c250mgrs14][/opt/teal/ibm/isnm]> tabdump site grep ea_primary "ea_primary_hmc","c250hmc05.ppd.pok.ibm.com",, Record the SRC or reference code, which is 8 digits and begins with BD. Example BD Record the problem description If there is location information, in the problem description use this to find the FRUs. The most important information is the frame, cage, supernode and drawer. Select the problem and then open up the problem Look at the FRU list If there is an isolation procedure at the head of the procedure, use the corresponding procedure in HFI Serviceable Event Procedures, on page 43. If there is no procedure listed, replace the FRUs in the order listed. If the FRU is a A+ resource, the SSR should call his next level of support to informa them and follow their instructions. Any FRU that has an R# (like R2) in its location code is a A+ resource. When the problem has been fixed: Close the problem on the primary EA HMC using Manage Serviceable Events With permission from the system administrator, close out the corresponding alert on the EMS using /opt/teal/bin/tlchalert -s close -i [alertid] HFI Network FRU Locations: There are several locations that are associated with HFI Network FRU Locations. In all cases, they begin with an enclosure's unit location and the main planar board for the 9125-F2C system: U78A P1, where U78A9.001 is the 9125-F2C enclosure, and is a unique 7 digit p. 42 of 119

43 enclosure serial number. P1 is the planar. The unit location is often abbreviated as U* for convenience in examples. When the VPD for the enclosure has not been properly read, the unit location can also be represented as either U####.###.####### or Uffff.iii.sssssss. Beneath the unit location and planar the individual FRU resources are found: Cables are represented by two locations and symbolic procedures (see HFI Serviceable Event Procedures, on page 43), because a cable is compromised of two connectors attaching to the bulkheads of two 9125-F2C systems. The bulkhead connectors are not labeled, but have a regular pattern. D-link ports are defined by two connector location identifiers using T to identify them. The two location identifiers are used because the D-link port connectors have an 8 connector assembly that groups connectors to be plugged into individual ports. For example, U*-P1-T1-T2. LR-link ports are defined by a single connector location identifier because they are grouped into on large connector assembly. For example, U*-P1-T9, where T9 is the locaiton to which the cable assembly is plugged. HFI_CAB is a symbolic procedure that represents one end of a cable. It is always followed in the FRU list by the other end of the cable, which is represented by the symbolic procedure, CBLCONT. See HFI Serviceable Event Procedures, on page 43. Hub modules are represented by a location R1 through R8. For example, U*-P1-R4. Optical modules are represented as an entity on the hub module and are numbered R1 through R40. In the case of D-link ports, there is a single port per optical module. In the case of LR-link ports, there are two resources be optical module. For example, U*-P1-R5-R13. Note that all service/physical locations beginning their instance numbering at one rather than at zero HFI Serviceable Event Procedures Serviceable event procedures may be found in the FRU list in serviceable events and TEAL alerts. There are two types of procedures used in the FRU list: An isolation procedure is used to define steps to follow in order to isolate a problem to root cause and repair it. A symbolic procedure represents a physical FRU that either has no VPD defined for it (such as a cable), or for which the VPD is unavailable for one reason or another. While any subsystem or domain analysis may yield symbolic or isolation procedures, the HFI network event procedure names always begin with HFI. See the following sub-sections for descriptions of the various procedures HFI Isolation Procedures: HFI isolation procedures are intended to step you through detailed isolation procedures. The following is a list of the isolation procedures. Following the list is the details for each procedure: p. 43 of 119

44 Isolation procedure HFI_DDG HFI_LDG HFI_NET HFIGCTR HFI_LNM HFINSFP HFINNBR HFI_IDR HFI_ISN HFI_IFR HFI_ICL HFI_UFC HFI_COM HFILONG Description This procedure uses network management link diagnostics to isolate to the root cause of a D-link failure. If isolation is not achieved, the SSR is directed to replace the parts according to the remainder of However, with Power 775 Availability Plus, the SSR may have to contact the next level of support to replace any link part other than a cable. This procedure uses network management link diagnostics to isolate to the root cause of a LR-link failure. If isolation is not achieved, the SSR is directed to replace the parts according to the remainder of However, with Power 775 Availability Plus, the SSR may have to contact the next level of support to replace any link part other than a cable. This procedure is used when there are network events across multiple locations and engineering support is required to help isolate the problem. This procedure is used when a global counter issue must be isolated. This procedure indicates a code problem within the ISNM code running on a service processor or with its interface with CNM on the EMS. This procedure indicates that no HMC was found to which to write a serviceable event. This can only be populated as a TEAL alert. This procedure is used when a neighbor was not reported as part of an event and a neighbor is required for proper root cause analysis. This procedure is used when many network link events are reportednd are all associated with the same drawer without any single event indicating that there is a problem with that drawer, or some user action has been taken against it. This procedure is used when many network link events are reportednd are all associated with the same supernode without any single event indicating that there is a problem with that supernode, or some user action has been taken against it. This procedure is used when many network link events are reportednd are all associated with the same frame without any single event indicating that there is a problem with that frame, or some user action has been taken against it. This procedure is used when many network link events are associated with multiple frames, which would suggest a cluster wide event without any single event indicating that there is a problem with that frame, or some user action has been taken against it. This is a code problem where an unknown FRU Class was passed by the alert metadata to TEAL. The problem is in CNM_GEAR_alert_metadata.xml. This procedure is used when a common part could not be determined for a complex combination of events. This is a code problem where a procedure name passed to TEAL is too long. The problem is in CNM_GEAR_alert_metadata.xml. p. 44 of 119

45 HFI_DDG This procedure is used to isolate a problem with a D-link. Record the SRC or refcode or alert_id Record the problem description or reason Record the FRU list particularly the HFI_CAB and CBLCONT FRUs Before proceeding, follow the Power 775 Availability Plus procedures to determine if action is required. See Power 775 Availability Plus actions, on page 61. Log on to the EMS and run /usr/bin/nwlinkdiag You will require a wrap device to perform link diagnostics. Use the information in the problem description to identify which ports to test. Go to the side of the link specified in the Symbolic Procedure FRU HFI_CAB and remove the D-link cable from the connector and replace it with the wrap device. Then, run nwlinkdiag. If a problem is found, call your next level of support and provide them with the problem details above, because the problem will probably have to be resolved by a special team trained to replace certain CEC hardware. If no problem is found, remove the wrap device and reconnect the D-link cable. Go to the other side of the link as specified in the Symbolic Procedure FRU CBLCONT and remove the D-link cable from the connector and replace it with the wrap device. Then, run nwlinkdiag. If a problem is found, call your next level of support and provide them with the problem details above, because the problem will probably have to be resolved by a special team trained to replace certain CEC hardware.. If no problem is found, proceed to the next step. If problem isolation is not accomplished with nwlinkdiag, the problem is most likely in the cable or its connectors: Check for damage to the cable or connectors. Proceed with caution if the cable is attached at one end, and use proper safety in the presence of the laser driving the cable signals. Determine if the cable needs to be cleaned. If there appear to be no obvious problems with the cable, replace it. If the problem still persists, call your next level of support and provide them with the problem details above, because the problem will probably have to be resolved by a special team trained to replace certain CEC hardware. After the repair, run check the health of the network to verify that the link is back up and operational after the repair action and that no other problems were caused. Use the HFI network health check procedure documented in High perfomance clustering using the 9125-F2C Management Guide. HFI_LDG This procedure is used to isolate a problem with an LR-link. Record the SRC or refcode or alert_id Record the problem description or reason Record the FRU list particularly the HFI_CAB and CBLCONT FRUs Before proceeding, follow the Power 775 Availability Plus procedures to determine if action is required. See Power 775 Availability Plus actions, on page 61 p. 45 of 119

46 If action is required, inform the system administrator that LR-link diagnostics impact an entire supernode. If necessary schedule this activity at a later date. Log on to the EMS and run /usr/bin/nwlinkdiag You will require a wrap device to perform link diagnostics. Use the information in the problem description to identify which ports to test. Go to the side of the link specified in the Symbolic Procedure FRU HFI_CAB and remove the LR-link cable from the connector and replace it with the wrap device. Then, run nwlinkdiag.. If a problem is found, call your next level of support and provide them with the problem details above, because the problem will probably have to be resolved by a special team trained to replace certain CEC hardware. If no problem is found, remove the wrap device and reconnect the LR-link cable. Go to the other side of the link as specified in the Symbolic Procedure FRU CBLCONT and remove the D-link cable from the connector and replace it with the wrap device. Then, run nwlinkdiag. If a problem is found, call your next level of support and provide them with the problem details above, because the problem will probably have to be resolved by a special team trained to replace certain CEC hardware.. If no problem is found, proceed to the next step. If problem isolation is not accomplished with nwlinkdiag, r, the problem is most likely in the cable or its connectors: Check for damage to the cable or connectors. Proceed with caution if the cable is attached at one end, and use proper safety in the presence of the laser driving the cable signals. Determine if the cable needs to be cleaned. If there appear to be no obvious problems with the cable, replace it. If the problem still persists, call your next level of support and provide them with the problem details above, because the problem will probably have to be resolved by a special team trained to replace certain CEC hardware. After the repair, run check the health of the network to verify that the link is back up and operational after the repair action and that no other problems were caused. Use the HFI network health check procedure documented in High perfomance clustering using the 9125-F2C Management Guide. HFI_NET This procedure is used when a network-wide problem has occurred. Record the SRC or refcode or alert_id Record the problem description or reason Record the FRU list Call your next level of support and provide them with the problem details recorded above. HFIGCTR This procedure is used when a global counter problem has occurred. Record the SRC or refcode or alert_id Record the problem description or reason Record the FRU list p. 46 of 119

47 Call your next level of support and provide them with the problem details recorded above. HFI_LNM This procedure is used when a problem has occurred with the ISNM code running in a service processor, or in the interface between a service processor and the EMS. Record the SRC or refcode or alert_id Record the problem description or reason Record the FRU list Call your next level of support and provide them with the problem details recorded above. HFINSFP This procedure is associated with a problem interfacing with an Event Analysis HMC. Perform the following procedure: Record the SRC or refcode or alert_id Record the problem description or reason Record the FRU list Determine if the HMC that is configured to be the Event Analysis HMC for HFI network events is down. Do this also for the backup Event Analysis HMC. If the HMC(s) are up, check the service network connection to them. Call your next level of support and provide them with the problem details recorded above. HFINNBR This indicates that a neighbor should have been reported as part of the event used to generate the alert and serviceable event. This is a critical piece of information for properly diagnosing the link problem. Therefore, call your next level of hardware support to obtain guidance in isolating out the problem. It is possible that the software service organization may also be engaged to determine why the neighbor is missing from the event. While waiting for guidance from your next level support, you can determine the expected neighbor by using the following procedure which involves a query on the EMS to ISNM: Get the location information for the reporting device from either the serviceable event description or the TEAL alert reason. Form the ISNM location information from the description by placing a dash between each location, as in the following example: Description/reason: D Link Port Lane Width Change between frame FR008 cage CG04 (supernode SN000 drawer DR1) hub HB7 port LD11 and frame FRxxx cage CGxx (supernode Snxxx drawer DRx) hub HBx port LDxx (D Link Port Lane Width Change) Location: FR008-CG04-SN000-DR1-HB7-D11 Log on to the EMS Run: /usr/bin/lsnwexpnbrs grep Loc: [location] For example: > lsnwexpnbrs grep "Loc: FR008-CG04-SN000-DR1-HB7-D11" Loc: FR008-CG04-SN000-DR1-HB7-D11 ExpNbr: FR008-CG11-SN002-DR0-HB7-D14 ActualNbr: FR008-CG11-SN002-DR0-HB7-D14 p. 47 of 119

48 The expect neighbor is: FR008-CG11-SN002-DR0-HB7-D14 For details on HFI network locations, see HFI Network Locations, on page 52. HFI_IDR This indicates that a drawer has a large number of link events associated with it in a short period of time. This is typically caused by user action or some sort of power or thermal event. Most such events are reported and taken into account and a serviceable event or alert is not reported. Record the SRC or refcode Record the problem description Record the FRU list Look for a serviecable event on the HMC that manages the 9125-F2C that makes up this drawer. If there is such an event, close out the serviceable event and ask the system administrator to close out the alert associated with this event; then, stop executing this procedure. Otherwise, work with the system administrator to determine if someone took an action to bring down the drawer. Location information from the problem description may be the best method to communicate location to the system administrator. If the drawer level network outage is explainable, close out the serviceable event and ask the system administrator to close out the alert. If the event cannot be explained, call your next level of support. HFI_ISN HFI_IFR This indicates that a supernode has a large number of link events associated with it in a short period of time. This is typically caused by user action or some sort of power or thermal event. Most such events are reported and taken into account and a serviceable event or alert is not reported. Record the SRC or refcode Record the problem description Record the FRU list Look for a serviecable event on the HMC that manages the 9125-F2C systems that make up this supernode. If there is such an event, close out the serviceable event and ask the system administrator to close out the alert associated with this event; then, stop executing this procedure. Otherwise, work with the system administrator to determine if someone took an action to bring down the supernode. Location information from the problem description may be the best method to communicate location to the system administrator. If the supernode level network outage is explainable, close out the serviceable event and ask the system administrator to close out the alert. If the event cannot be explained, call your next level of support. This indicates that a frame has a large number of link events associated with it in a short period of time. This is typically caused by user action or some sort of power or thermal event. Most such events are reported and taken into account and a serviceable event or alert is not reported. Record the SRC or refcode Record the problem description Record the FRU list Look for a serviecable event on the HMC that manages the frame. If there is one, close out the p. 48 of 119

49 serviceable event and ask the system administrator to close out the alert associated with this event, and stop executing this procedure. Otherwise, work with the system administrator to determine if someone took an action to bring down the frame (such as hitting the EPO switch). Location information from the problem description may be the best method to communicate location to the system administrator. If the frame level network outage is explainable, close out the serviceable event and ask the system administrator to close out the alert. If the event cannot be explained, call your next level of support. HFI_ICL This indicates that a cluster has a large number of link events associated with it in a short period of time. This is typically caused by user action or some sort of power or thermal event. Most such events are reported and taken into account and a serviceable event or alert is not reported. Record the SRC or refcode Record the problem description Record the FRU list Work with the system administrator to determine if someone took an action to bring down the such a large number of links., or if there was a problem with the power feed into the cluster. If the cluster level network outage is explainable, close out the serviceable event an ask the system administrator to close out the alert. If the event cannot be explained, call your next level of support and provide the problem details recorded above. HFI_UFC This is a code problem where an unknown FRU Class was passed by the alert metadata to TEAL. Record the SRC or refcode Record the problem description Record the FRU list Call your next level of support and provide the problem details recorded above. HFI_COM This indicates that a complex analysis has determined that a combination of events has a single root cause, but a common failing device has not been determined. Record the SRC or refcode Record the problem description Record the FRU list Call your next level of support and provide the problem details recorded above. HFILONG This indicates that a bad parameter has been passed into TEAL to use for determining a FRU list. Perform the following procedure: Record the problem description or reason Record the FRU list Call your next level of support and provide them with the problem details recorded above. p. 49 of 119

50 HFI Symbolic Procedures: HFI symbolic procedures are intended to describe the FRUs they represent. In some cases, more details are required to help identify missing information. The following is a list of the symbolic procedures. Following the list is the details for each procedure: Symbolic procedure HFI_CAB CBLCONT HFI_OM HFILONG HFI_LRA HFI_VPT HFI_VHM HFI_VPL Description This procedure represents one end of the cable. Typically, it will be the end of the cable that is most likely to have failed based on which side reported the problem. This procedure represents the other end of the cable. This procedure represents an optical module. This is a code problem where a procedure name passed to TEAL is too long. The problem is in CNM_GEAR_alert_metadata.xml. This procedure represents an LR-link cable assembly, because it never has any FRU datan Associated with it. This procedure represents a port when the port is unknown by the code. This procedure represents the hub module when the hub module VPD is not accessible. This procedure represents the planar when the planar VPD is not accessible. HFI_CAB Before working with this procedure be sure to have run any isolation procedure listed before it in the FRU list. When a FRU is represented by HFI_CAB, perform the following: Record the location of the HFI_CAB FRU. Record the location of the CBLCONT which follows the HFI_CAB FRU Check both ends of the cable for obvious seating issues Check both connectors to see if they are dirty. If so, then clean them. CAUTION: Always use proper safety measure with optical cables and connectors which have lasters. If the cables look well seated and clean, replace the cable; obtain a new cable that is long enough to reach between the two ports identified by HFI_CAB and CBLCONT. If the cable is from the original installation, it will most likely be part of a bundle. Do not attempt to disassemble the cable bundle or replace it. Use a single optical cable as a replacement, and bury or cut off the ends of the failed cable. p. 50 of 119

51 After repairing the cable, run check the health of the network to verify that the link is back up and operational after the repair action and that no other problems were caused. Use the HFI network health check procedure documented in High perfomance clustering using the 9125-F2C Management Guide. CBLCONT CBLCONT represents the continuation of a cable definition. All procedures against this FRU location should be performed under the HFI_CAB procedured., above. HFI_OM HFILONG This represents an optical module. The optical module is not replaceable. It is replaced using the procedure for replacing a hub module, which requires special training. Call your next level of support. Note: This FRU is primarily used to help in tracking A+ resource problems particularly for LR-link ports. This procedure indicates that an HFI procedure in the alert metadata used for analysis rules is too long for the part number field, which has a 7 character maximum. Perform the following procedure: Record the SRC or refcode Record the problem description. Record In particular, make note of the HFILONG Call your next level of support and indicate that a code problem exists with a network event for a 9125-F2C system. In particular indicate that the HFILONG symbolic procedure has been encountered. HFI_LRA This procedure indicates an LR-link cable assembly. The entire cable assembly must be replaced for all drawers within a supernode. Perform the following procedure: Record the SRC or refcode Record the problem description Record the FRU list Before proceeding inform the system administrator that a repair action will be impacting an entire supernode. Work with the system administrator to determine if maintenance should be deferred until another time. Obtain a replacement LR-link cable assembly. Detach the broken LR-link cable assembly from all drawers in the supernode. Be careful not to disturb and D-link cable connectors. Replace the LR-link cable assembly with the new one. Run check the health of the network to verify that the link is back up and operational after the repair action and that no other problems were caused. Use the HFI network health check procedure documented in High perfomance clustering using the 9125-F2C Management Guide. p. 51 of 119

52 HFI_VPT HFI_VHM HFI_VPL The location for a cable connector attached to a port is not known. This may preclude you from running link diagnostics without the help of your next level of support. Record the SRC or refcode Record the problem description Record the FRU list Call your next level of support and indicate that HFI_VPT is in The support team will work with the SSR and system administrator to isolate the link. The VPD for a hub module is not known. Typically, you will not be at this FRU until you have performed an isolation procedure. Furthermore, the replacement of a hub module requires special training. Record the SRC or refcode Record the problem description Record the FRU list Call your next level of support and indicate that HFI_VHM is in The support team will work with the system administrator to determine the proper hub module to use as a replacement. The VPD for a system planar is not known. Typically, you will not be at this FRU until you have performed an isolation procedure. Furthermore, the replacement of a planar for network problems should be done under the direction of Product Engineering. Record the SRC or refcode Record the problem description Record the FRU list Call your next level of support and indicate that HFI_VPL is in The support team will work with the system administrator to determine the proper planar to use as a replacement HFI Network Locations There are several location formats used to describe HFI network locations. These are: Location type Where used Description Logical Hardware TEAL These describe hardware locations relative to a software understanding of the physical structure of the network. See HFI Network Logical Hardware Locations, on page 53 Service/Physical Hardware Serviceable events on the HMC TEAL raw_data:fru_list These describe the hardware locations relative to the architected Power server location format. See HFI Network Service/Physical Hardware Locations, on page 54. Application TEAL These describe application locations. In the p. 52 of 119

53 case of HFI, this refers particular to CNM on the EMS. See HFI Network Application Locations, on page HFI Network Logical Hardware Locations The HFI network logical hardware locations identify the components in a hierarchical manner starting with the frame and going all the way down to a particular link port. Both CNM and TEAL use a similar format with TEAL providing an extra layer of data. The formats are a combination of keywords followed by an instance number. Each keyword indicates the type of location. The formats will be listed first; followed by the list of keywords, followed by some examples. For TEAL, the format is either: For links: FRxxx-CGxx-SNxxx-DRx-HBx-OMxx-L[DLR]x; where L[DLR] is either LD, LR or LL For HFI ramps: FRxxx-CGxx-SNxxx-DRx-HBx-HFx-RMx For hubs: FRxxx-CGxx-SNxxx-DRx-HBx For ISNM, the format is: FRxxx-CGxx-SNxxx-DRx-HBx-L[DRL]x Note: If the referenced component is at a higher level, only the number of levels required to locate that component will be in the location code. For example, a frame and cage may be all that is required to locate a CEC drawer, therefore, the location code will be like: Frxxx-CGxx. The location keywords are in the following table: Keyword Description FR Frame 0 through 999 CG CEC Cage This is the cage within the frame. It is sequential from the bottom to the top of the frame and is based on plugging of the power cables. SN Supernode 0 through 511 DR HB OM LD LR LL HF RM Drawer 0 through 3. This is the drawer within the supernode. Hub 0 through 7 within the cage or drawer. Optical module This is on each particular hub. There is one optical module location per D-link and one for every two LR-links. D-link port This is on each particular hub. NOTE: ISNM uses D to denote a D-link. LR-link port This is on each particular hub. LL-link port 0-6. This is on each particular hub. HFI 0-1. This is on each particular hub. HFI Ramp 0-3. This is within each HFI. p. 53 of 119

Examples of logical hardware locations are: FR003-CG13-SN06-DR1-HB3-OM5-LD05 # in TEAL FR003-CG13-SN06-DR1-HB3-OM25-LR18 # in TEAL FR003-CG13-SN06-DR1-HB3-LR22 # in ISNM

54 Examples of logical hardware locations are: FR003-CG13-SN06-DR1-HB3-OM5-LD05 # in TEAL FR003-CG13-SN06-DR1-HB3-OM25-LR18 # in TEAL FR003-CG13-SN06-DR1-HB3-LR22 # in ISNM FR003-CG13-SN06-DR1-HB3-HF1-RM2 # in TEAL For mapping between HFI network logical hardware locations and service/physical locations, see Mapping between logical hardware and service/physical locations, on page HFI Network Service/Physical Hardware Locations The HFI network service/physical hardware locations are compliant with the Power service physical locations. They consist of one of the following: [unit location]-[planar]-[hub resource]-[optical module resource] [unit location]-[planar]-[connector location] The unit location is defined as the following for the 9125-F2C systems: U78A9.001.[cage serial number]; where the serial number is 7 digits. The unit location is on a label on the system. The planar associated with HFI network events in a 9125-F2C system is always P1. The hub and optical module resource keyword is R. There are 8 hubs (R1 through R8) and 40 optical module resource locations (R1-R40). See the following diagram for a map of the physical location of the hubs: See the following diagram for a map of the optical modules on a hub: p. 54 of 119

The connector locations have two levels: Ty-Tx, where Ty indicates rows of connectors (T1-T17), and Tx indicates columns of connectors (T1-T8).The LR-links are all associated with T9.

55 The connector locations have two levels: Ty-Tx, where Ty indicates rows of connectors (T1-T17), and Tx indicates columns of connectors (T1-T8).The LR-links are all associated with T9. Therefore, there is no Ty for LR-links. D- links use both Ty and Tx. The following is a diagram of the connectors. Note that T1-T1 is in the upper left and T17-T8 is in the bottom right. p. 55 of 119

You may note how each hub's D-links occupy 2 columns in each row. Examples of service/physical locations: U78A9.001.2345678-P1-T3-T4 # D-link connector U78A9.001.2345678-P1-T9 # LR-link connector U78A9.

56 You may note how each hub's D-links occupy 2 columns in each row. Examples of service/physical locations: U78A P1-T3-T4 # D-link connector U78A P1-T9 # LR-link connector U78A P1-R1 # Hub location U78A P1-R3-R5 # optical module location for a D-link U78A P1-R5-R27 # optical module for an LR-link For mapping between HFI network logical hardware locations and service/physical locations, see Mapping between logical hardware and service/physical locations, on page HFI Network Application Locations When CNM is a significant reporting location or might be the root cause of an alert or serviceable event, it is reported to TEAL using an application location. The alert event_loc_type or event src_loc_type or event report_loc_type will be A for an application location. The format for a CNM application location is: [EMS hostname]##cnmd##[cnmd pid]. For example: myemsa##cnmd## Mapping between logical hardware and service/physical locations In order to map between the logical hardware and service/physical locations, several steps are required. The most difficult is to map between the frame, cage, supernode and node location and the unit location of the p. 56 of 119

57 service/physical location. This requires either a written map or a lookup of data in the cluster database. However, quite often it is not necessary to do this. Instead, if you are looking at a service/physical location in a FRU list, you probably have frame and cage, and supernode and drawer information from either the event description in Service Focal Point, or the reason in TEAL. When servicing a CEC drawer location and using the frame and cage to find it, you should double check the unit location information on the label on the CEC drawer. An example message is: D Link Port Lane Width Change between frame FR008 cage CG04 (supernode SN000 drawer DR1) hub HB7 port LD11 and frame FR008 cage CG11 (supernode SN002 drawer DR0) hub HB7 port LD14 (D Link Port Lane Width Change) If you are looking at a logical hardware location, the frame and cage will get you to the CEC drawer on the floor. To find the MTMS (for double-checking the location), use the following to search for it. The example is for frame 5, cage 13. (Recall that cage numbering for 9125-F2C servers starts at 3 and ends at 14. lsdef cec -i frame,id,mtm,serial -c awk '{ if ( $1!= a ) { printf "\n"$1" "$2" "} else { printf $2" " } a=$1}end{print}' grep "frame=5.*id=13" All HFI network locations within a CEC drawer are on planar P1. The next level to map to is the hub level. This is simply a matter of mapping the logical hardware's HBx to Rx in the service location. To go from HBx to Rx, add one to HBx. For example, HB2 becomes R3. To map from Rx to HBx, subtract 1 from Rx. For example, R5 becomes HB4. If there is an optical module defined (Omxx), then use the following table to map to an optical module service location. Note that the RX in the optical module service location is the hub location. Table 2: Port Logical Hardware Location to Optical Module Service Location Port Hardware Location Optical Module Hardware Location Optical Module Service Location OM0-LD0 OM0 RX-R28 OM1-LD1 OM1 RX-R15 OM2-LD2 OM2 RX-R7 OM3-LD3 OM3 RX-R14 OM4-LD4 OM4 RX-R6 OM5-LD5 OM5 RX-R13 OM6-LD6 OM6 RX-R5 OM7-LD7 OM7 RX-R12 OM8-LD8 OM8 RX-R4 OM9-LD9 OM9 RX-R11 OM10-LD10 OM10 RX-R3 OM11-LD11 OM11 RX-R10 OM12-LD12 OM12 RX-R2 OM13-LD13 OM13 RX-R9 OM14-LD14 OM14 RX-R1 OM15-LD15 OM15 RX-R8 OM16-LR0 OM16 RX-R26 p. 57 of 119

58 Port Hardware Location Optical Module Hardware Location Optical Module Service Location OM16-LR1 OM16 RX-R27 OM17-LR2 OM17 RX-R39 OM17-LR3 OM17 RX-R40 OM18-LR4 OM18 RX-R24 OM18-LR5 OM18 RX-R25 OM19-LR6 OM19 RX-R37 OM19-LR7 OM19 RX-R38 OM20-LR8 OM20 RX-R22 OM20-LR9 OM20 RX-R23 OM21-LR10 OM21 RX-R35 OM21-LR11 OM21 RX-R36 OM22-LR12 OM22 RX-R20 OM22-LR13 OM22 RX-R21 OM23-LR14 OM23 RX-R33 OM23-LR15 OM23 RX-R34 OM24-LR16 OM24 RX-R18 OM24-LR17 OM24 RX-R19 OM25-LR18 OM25 RX-R31 OM25-LR19 OM25 RX-R32 OM26-LR20 OM26 RX-R16 OM26-LR21 OM26 RX-R17 OM27-LR22 OM27 RX-R29 OM27-LR23 OM27 RX-R30 The link location along with the hub location will allow you to map to a link port connector on the bulkhead of the CEC drawer: If you have an D-link hardware location (LD), use the following table by cross-referencing the LD in a row and the hub in the column. For example, LD4 on HB5 is service/physical location P1-T3-T5. If you have an LR-link location (LR), then the service/physical location is P1-T9, because LR-links are part of one cable assembly. If you have an LL-link location (LL), then the service/physical location is on the hub alone, because there is no defined service/physical location for an LL-link. Table 3: D-link hardware location to service location D-link (LD) HB0 HB1 HB2 HB3 HB4 HB5 HB6 HB7 0 P1-T10-T7 P1-T10-T5 P1-T10-T3 P1-T10-T1 P1-T1-T7 P1-T1-T5 P1-T1-T3 P1-T1-T1 1 P1-T10-T8 P1-T10-T6 P1-T10-T4 P1-T10-T2 P1-T1-T8 P1-T1-T6 P1-T1-T4 P1-T1-T2 2 P1-T11-T7 P1-T11-T5 P1-T11-T3 P1-T11-T1 P1-T2-T7 P1-T2-T5 P1-T2-T3 P1-T2-T1 3 P1-T11-T8 P1-T11-T6 P1-T11-T4 P1-T11-T2 P1-T2-T8 P1-T2-T6 P1-T2-T4 P1-T2-T2 4 P1-T12-T7 P1-T12-T5 P1-T12-T3 P1-T12-T1 P1-T3-T7 P1-T3-T5 P1-T3-T3 P1-T3-T1 5 P1-T12-T8 P1-T12-T6 P1-T12-T4 P1-T12-T2 P1-T3-T8 P1-T3-T6 P1-T3-T4 P1-T3-T2 6 P1-T13-T7 P1-T13-T5 P1-T13-T3 P1-T13-T1 P1-T4-T7 P1-T4-T5 P1-T4-T3 P1-T4-T1 p. 58 of 119

59 D-link HB0 HB1 HB2 HB3 HB4 HB5 HB6 HB7 (LD) 7 P1-T13-T8 P1-T13-T6 P1-T13-T4 P1-T13-T2 P1-T4-T8 P1-T4-T6 P1-T4-T4 P1-T4-T2 8 P1-T14-T7 P1-T14-T5 P1-T14-T3 P1-T14-T1 P1-T5-T7 P1-T5-T5 P1-T5-T3 P1-T5-T1 9 P1-T14-T8 P1-T14-T6 P1-T14-T4 P1-T14-T2 P1-T5-T8 P1-T5-T6 P1-T5-T4 P1-T5-T2 10 P1-T15-T7 P1-T15-T5 P1-T15-T3 P1-T15-T1 P1-T6-T7 P1-T6-T5 P1-T6-T3 P1-T6-T1 11 P1-T15-T8 P1-T15-T6 P1-T15-T4 P1-T15-T2 P1-T6-T8 P1-T6-T6 P1-T6-T4 P1-T6-T2 12 P1-T16-T7 P1-T16-T5 P1-T16-T3 P1-T16-T1 P1-T7-T7 P1-T7-T5 P1-T7-T3 P1-T7-T1 13 P1-T16-T8 P1-T16-T6 P1-T16-T4 P1-T16-T2 P1-T7-T8 P1-T7-T6 P1-T7-T4 P1-T7-T2 14 P1-T17-T7 P1-T17-T5 P1-T17-T3 P1-T17-T1 P1-T8-T7 P1-T8-T5 P1-T8-T3 P1-T8-T1 15 P1-T17-T8 P1-T17-T6 P1-T17-T4 P1-T17-T2 P1-T8-T8 P1-T8-T6 P1-T8-T4 P1-T8-T2 The HFI (HFx) and Ramp (RMx) locations are within the hub and have no corresponding service/physical location other that the hub within which they are populated. Example translations: FR001-CG05-SN001-DR0-HB5-LD14 Frame 1, Cage 5, Supernode 1, Drawer 0, Hub 5, D-link 14. The CEC drawer can be found in frame 1 cage 5. The hub location is: U*-P1-R6 (Hub 5 + 1=R6) The optical module location is: U*-P1-R6-R1 (lookup Table 3: D-link hardware location to service location, on page 58, using LD14) The link port connector is: U*-P1-T8-T3 (lookup Table 2: Port Logical Hardware Location to Optical Module Service Location, on page 57, using HB5-LD14) FR003-CG09-SN003-DR2-HB3-LR15 Frame 3, Cage 9, Supernode 3, Drawer 2, Hub 3, LR-link 15 The CEC drawer can be found in frame 3 cage 9 The hub location is: U*-P1-R4 (Hub 4 + 1=R6) The optical module location is: U*-P1-R6-R1 (lookup Table 3: D-link hardware location to service location, on page 58) The link port connector is: U*-P1-T9 (all LR links are on T9) U78A P1-T8-T7 Unit location 78A ; planar 1; Hub resource R5; D-link connector T8-T11 p. 59 of 119

60 Find the frame and cage in the message in either the serviceable event description, or the TEAL reason. For this example, assume it is frame 4, cage 11, supernode 5, drawer 0 The D-link is: LD14 (lookup Table 2: Port Logical Hardware Location to Optical Module Service Location, on page 57 using T8-T7) The hub is: R5; (lookup Table 2: Port Logical Hardware Location to Optical Module Service Location, on page 57 using T8-T7 to get HB4; and HB4 + 1 = R5) The optical module is: OM6 (lookup Table 3: D-link hardware location to service location, on page 58, using LD14) The complete logical hardware location is: FR004-CG11-SN005-DR00-HB4-OM6-LD14 U78A P1-R3-R6 Unit location 78A ; planar 1; Hub resource R5; D-link connector T8-T11 Find the frame and cage in the message in either the serviceable event description, or the TEAL reason. For this example, assume it is frame 4, cage 11, supernode 5, drawer 0 The hub is: HB2 (R3 1 = HB2) The optical module is: OM4 (lookup Table 3: D-link hardware location to service location, on page 58, using R6) The D-link is : LD4 (lookup Table 3: D-link hardware location to service location, on page 58, using LD14) The logical hardware location of the optical module is: FR004-CG11-SN005-DR00-HB2-OM4 The logical hardware location of the optical module is: FR004-CG11-SN005-DR00-HB2-OM4- LD4 p. 60 of 119

61 Power 775 Availability Plus actions When a hardware serviceable event is reported it will include a FRU list. If an IBM Power 775 Availability Plus (A+) resource is reported in the FRU list, then the A+ procedures must be performed. Many of the tasks performed by the administrator will use xcat commands to reassign resources and help track them. The xcat document on Cluster Recovery ( title=cluster_recovery) covers these details, as well as other xcat information related to A+. An A+ resource is identified when: The FRU location has an R identifier. For example, U787C P1-R1. Note: The location identifier U787C is often noted in examples using the shorthand of U*. Also, the following ranges of locations are defined using regular expressions. o U*-P1-R[1-8] are HFI hub modules. For example: U*-P1-R7 o U*-P1-R[1-8]-R[1-40] are optical module port locations. Note that they are contained on the hub modules: U*-P1-R[1-8]. For example, U*-P1-R3-R5.U*-P1-R[9-40] are processor chip locations. For example U*-P1-R10. These are in groups of four per QCM. For example, U*-P1-R[13-16] are all on the same QCM, which is in the second octant. The FRU location is an HFI port location.u*-p1-t[1-8]-t* and U-P1-T[9-17]-T* are D-link port connector locations on the bulkhead.u*-p1-t9 is the LR-link prot connector location on the bulkhead. Note: More information on hardware service/physical locations can be found in Hardware Service Locations for 9125-F2C, on page 75. When an A+ resources is identified in the FRU list, use the A+ recovery procedure Availability Plus recovery procedure Use this procedure when you have identified a serviceable event that contains an A+ resource in You will be required to do the following tasks: Determine the resource that has failed Determine if this is a repeat failure of the same resource Determine the type of node (if any) it impacts Perform the appropriate recovery actions based on the above and the local spare policy Report the problem to IBM Gather data for IBM to determine if there are any repairs necessary at this time using /usr/xcat/sbin/gatherfip Record the failed resource using the following table, which is also documented in the High perfomance clustering using the 9125-F2C Management Guide under Power 775 Availability Plus Management. Table 4: IBM Power 775 Availability Plus Failed Resource Record Resource hostname Location Spare Policy Date deployed p. 61 of 119

62 (Frame/Slot; SuperNode/Drawer) (hot/cold/warm) Please perform the following: 1. Determine the resource that has failed by examining the FRU list of the serviceable event and crossreferencing the FRU locations in the table below. Table 5: FRU location to Availability Plus resource FRU locations U*-P1-R[9-40] U*-P1-R[1-8] and no other FRU U*-P1-R[1-8] and Resource QCM Hub module D-link U*-P1-R[1-8]-R[1-15] or U*-P1-R[1-8]-R28 and U-P1-T*-T* U*-P1-R[1-8] and LR-link U*-P1-R[1-8]-R[16-27] or U*-P1-R[1-8]-R[29-40] and U*- P1-T9 U*-P1-R[1-8]-R[1-15] or U*-P1-R[1-8]-R28 and no other FRU U*-P1-R[1-8]-R[16-27] or U*-P1-R[1-8]-R[29-40] and no other FRU Two separate locations with the same unit location (U*) and of the format: U*-P1-R[1-8]. D-link optical module LR-link optical module LL-link failure This indicates a failure in an interface between two hubs. 2. Reference a record of previous failures. If the FRU locations in this failure match those of a previous failure, you should have already recovered from this failure, and you can exit this procedure. For a method to record failures see Table 4: IBM Power 775 Availability Plus Failed Resource Record, on page If the only FRU location in the FRU list is U*-P1-T9 and the serviceable event message indicates a problem with an LR-link cable assembly, call IBM immediately and open a PMH. 4. Determine the location of the failure: Record the unit location from the FRU location. This is the U* number. It will be of the format: U787C On the EMS, use xcat to cross-reference the unit location to the frame and drawer. p. 62 of 119

63 Use the procedure Using xcat to cross-reference Unit locationsdetermine the octant for the QCM using the table below: Table 6: QCM to Octant Map QCM location Octant U*-P1-R[9-12] 0 U*-P1-R[13-16] 1 U*-P1-R[17-20] 2 U*-P1-R[21-24] 3 U*-P1-R[25-28] 4 U*-P1-R[29-32] 5 U*-P1-R[33-36] 6 U*-P1-R[37-40] 7 U*-P1-R1* 0 U*-P1-R2* 1 U*-P1-R3* 2 U*-P1-R4* 3 U*-P1-R5* 4 U*-P1-R6* 5 U*-P1-R7* 6 U*-P1-R8* 7 Note: If an optical module location is given U*-P1-R[1-8]-R[1-40], the octant is determined using just the hub module portion of the location: U*-P1-R[1-8] Note: Typically, when a link port connector location is in a FRU list there will also be an optical module location. Use the hub module portion of the optical module location to determine the associated octant. The exception will be a LR-link cable assembly problem which will only include the LR-link cable assembly location: U*-P1-T9. This will impact more than one octant and you should call IBM immediately. 5. If this is a QCM failure, determine the type of node that is impacted. At this point, all you need to determine is if this is Compute node or a non-compute node. 6. Refer to the cluster configuration and determine if the node in the frame, drawer and octant found above is a compute node or a non-compute node. p. 63 of 119

64 7. Perform the appropriate recovery procedure by cross-referencing the information in the table below. For many failed resources, the recovery procedure is the same regardless of node type. After performing the recovery procedure return to this procedure and continue on. Table 7: Failed Resource to A+ recovery procedure Failed Resource Compute Node Non-Compute Node QCM Availability Plus: Recovering a Compute Node, on page 64 Availability Plus: Recovering a non-compute node, on page 65 Hub and QCM Availability Plus: Recovering a hub module, on page 70 With SRC of format B114RRRR Hub Module Availability Plus: Recovering a hub module, on page 70 D-link Availability Plus: Recovering a D-link, on page 72 LR-link Availability Plus: Recovering an LR-link, on page 73 D-link optical module LR-link optical module LL-link failure Availability Plus: Recovering a D-link, on page 72 Availability Plus: Recovering an LR-link, on page 73 The network connectivity for the entire drawer is compromised. Use the procedure in Availability Plus: Recovering a failure that impacts a drawer, on page Report the problem to IBM and open a PMH 9. Gather data for IBM to determine if there are any repairs necessary at this time. Use the procedure in Gathering data for Availability Plus resource failures, on page 75, then return here. 10. Record this failure s FRU location(s), physical location(s), resource type and node type (if applicable) 11. If IBM dispatches personnel for repair, record the repaired resources to be referenced for future failure recovery. Use Table 4: IBM Power 775 Availability Plus Failed Resource Record, on page Verify that everything is in the proper state using Verifying state after Availability Plus failure recovery, on page Verifying state after Availability Plus failure recovery, on page, then return here. 13. This procedure ends here Availability Plus: Recovering a Compute Node Perform this procedure if you have determined that an A+ resource has failed in a Compute Node. You should have been sent here by another procedure. 1. Policies: Determine your local spare policy and local job management policy for A+. The local spare policy can be Hot spares, Cold spares or Warm spares The local job management policy can include a pool for A+ resources or a pool for failed resources. p. 64 of 119

65 Other policies may exist, but the effects on the following procedure must be documented locally. 2. Bring spare on-line: Perform the necessary steps to bring the spare resource on-line based on your local spare policy. After that, proceed to the step for job management. If the local spare policy is for Hot spares, you do not need to do anything to bring the spare A+ resource on-line, because it is already on-line and usable. Go to the step labeled: Job management If the local spare policy is for Cold spares, you will need to bring the spare resource on-line from the powered-off state. a) Drain the job queue for the rest of the nodes in the drawer with the failed Compute node and make sure that no jobs can start on them until the process is finished. b) Re-IPL the drawer with the failed Compute node using xcat s rpower command on the EMS.Go to the step labeled: Job management i. If the local spare policy is for Warm spares, you will need to bring the spare resource on-line from the partition standby state. 3. Boot the partition for the spare Compute node using the xcat rpower command on the EMS. 4. Go to the step labeled: Job management 5. Job management: Perform the steps necessary for job management with respect to the spare resource and failed resource: If the local job management policy has defined a job pool for spare resources, determine an available spare Compute node and move it into the job pool that is used for Workload Resources. You should have created a tracking mechanism for spare resources in the cluster like the one described in Power 775 Availability Plus Management in the High perfomance clustering using the 9125-F2C Management Guide. If the local job management policy has defined a job pool for failed resources, move the failed Compute node into that pool. 6. Failed Resource State: If the local policy is to not use failed resources, then you will want to power off the QCM containing the failed resource. Use xcat rpower on the EMS. rpower [nodename] off Record the failed octant in the Aplus-defective group: chdef -t group -o Aplus_defective members=[node] 7. This procedure ends here. Return to the procedure that pointed to this procedure Availability Plus: Recovering a non-compute node Perform this procedure if you have determined that an A+ resource has failed in a non-compute Node. You should have been sent here by another procedure. Important terms: The following terms will be used during this procedure. p. 65 of 119

66 non-compute node target compute node spare compute node The node that has failed. The node to with which the failed non-compute node will be swapped The node that will assume the workload of the target compute node. This is necessary if the target node is currently being used to fulfill the workload resources. Summary You will be performing the following tasks in the procedure: 1. Determine local sparing and job management policies for A+ 2. Identify a Compute node to replace the failed non-compute node. This is the target compute node. 3. If necessary identify and bring on-line a spare Compute node to replace the Compute node used to replace the non-compute node, because you are going to be forcing a Compute node into a failed state when you swap it with the non-compute node. For example, if node A is the failed node, and node B is the node that will be taking its place, you may require a node C to be used as a spare for node B. This is the spare compute node. 4. The above step will also take care of the state of the failed node resource 5. Swap the Compute node and the non-compute node The detailed procedure follows: 1. Policies: Determine your local spare policy and local job management policy for A+. The local spare policy can be Hot spares, Cold spares or Warm spares will drive different recovery actions. The local job management policy can include a pool for A+ resources or a pool for failed resources. Other policies may exist, but the effects on the following procedure must be documented locally. 2. Determine the Type of non-compute node: The type of non-compute node will be important in understanding the partitioning that will be required on the target node. It can also drive other recovery actions. p. 66 of 119

67 Table 8: non-compute node configuration and A+ recovery Type of non- Compute node Service Node GPFS Storage Node Login Node Other Partition and recovery information A service node is critical to maintaining operation on the nodes that it services. It will have a disk and ethernet adapter assigned to it. Before proceeding, use the service node recovery procedure in the xcat document on Cluster Recovery ( The GPFS Storage node will have SAS adapters assigned to it. If the GPFS Storage node is still operational, before proceeding ensure that there is an operational backup node or nodes for the set of disks that it owns. If there is a backup, you can proceed with this procedure. If there is no backup, then you must first recover or repair the backup before proceeding or the filesystem will come down during this procedure. The Login node will have an ethernet adapter assigned to it. If the Login node is still operational, before proceeding ensure that there is another operational Login node or nodes. If there are other operational Login nodes, you can proceed with this procedure. If there is no other operational Login node, consider first recovering or repairing another Login node before proceeding or users will lose access to the cluster during this procedure. Other non-compute nodes probably have adapters assigned to them. If this node provides a critical function to the cluster, and it is still operational, you will want to check that a backup node is available to take over its function during this procedure, because the node will be rendered unoperational for a period of time while it is moved to another hardware resource. 3. Drawer Resource: Determine if you have a Compute node resource in the drawer. c) Based on the information gathered to determine the frame and slot location of the drawer, look up any functional and failed resources in that drawer. d) Determine with which Compute node you will swap the failed non-compute node so that you can restore the non-compute node function; this is the target compute node. Use the table below. The table is in order of preference. Cross-reference the target compute node location and target compute node state and perform the action in the last column. Bear in mind that part of the swap process requires rebooting the logical partitions that contain the nodes that will be swapped. Table 9: Compute Node A+ actions based on state Compute Node location In drawer with failed node Compute Node state Hot spare Action on Compute Node Drain jobs from the compute node and prevent new jobs from starting on it. p. 67 of 119

68 Compute Node location In backup drawer Compute Node state Warm spare Cold spare Workload Resource Hot spare Warm spare Cold spare Workload Resource Action on Compute Node Boot the partition and prevent new jobs from starting on the compute node. No action required on this compute node. Drain jobs from the compute node and prevent new jobs from starting on it. Drain jobs from the compute node and prevent new jobs from starting on it. Boot the partition and prevent new jobs from starting on the compute node. No action required Drain jobs from the compute node and prevent new jobs from starting on it. e) Go to the step labeled Spare Compute node. 4. Spare Compute node: f) Select a spare node to replace the target compute node, that is going to replace the failed non-compute node. Use nodels or lsdef to determine if a spare node is in the CEC with the non-compute node. For more information, see the xcat document on Cluster Recovery ( g) Perform the procedure in Availability Plus: Recovering a Compute Node, on page 64. Treat the chosen spare node in the procedure as a spare Compute node, and the target compute node as the failed compute node. After completing the referenced procedure return to the next step of this procedure, Prepare to swap nodes. 5. Prepare to swap nodes: h) Determine if the octant containing the target compute node and the octant containing non-compute node which will be swapped are partitioned in the same manner. If they are partitioned the same, then go to the next step, labeled Swap nodes. If they are not partitioned the same, perform the following procedure before proceeding to Swap nodes. An example of nodes that are not partitioned the same is a Compute node that occupies an entire octant and a Service node that shares an octant with a non-gpfs IO node. Another example would be a GPFS node that has SAS adapters assigned to it and a compute node with no SAS adapters assigned to it. p. 68 of 119

69 For more information on the following reference the xcat document on Cluster Recovery ( title=cluster_recovery). Before proceeding consider the following with respect to the different types of noncompute nodes. i. Use lsvm [cec] to determine the partitioning in the CEC drawer. ii. Use lsdef to list the definition for the non-compute node partition. iii. Use lsdef to list the definition for the compute node partition iv. Use lsvm [non-compute partition] to determine the partition definition for the noncompute node partition. Save the results to be used later in the chvm command. v. Use lsvm [non-compute partition] to determine the partition definition for the compute node partition. vi. If the target compute node vii. How to partition the compute node as the non-compute node viii. How to partition the non-compute node as the compute node ix. Partition the target compute node in the same manner as the non-compute node, using chvm. Use the saved results of the lsvm command above by editing them and then pointing to them in the chvm command. For more information on doing this, see the xcat document on Cluster Recovery ( i) If you do not have an operational Compute node in the drawer, you will swap the non-compute node function to a backup drawer. Identify the backup drawer for this non- Compute node. This should have been identified in the planning and installation phases for this cluster. Then, do the following: x. Contact IBM to move the adapters and cables in the drawer with the failed non- Compute resource to the backup drawer for the failed non-compute resource. xi. Drain all jobs from both drawers and prevent any jobs from starting on them. xii. You will power off both drawers. Another non-compute node that acts as a backup to the failed non-compute node should assume the function of the failed node during this time. xiii. The IBM SSR will move the adapters and cables from one drawer to the other. xiv. Use rmhwconn to take the non-compute node out of the management domain for hardware server. xv. Go to Swap nodes 6. Swap nodes: If multiple non-compute nodes shared the same failing resource, then you will have to swap both nodes to the location of the Compute node. Depending on the configuration of the failing non-compute resources and the spare non-compute nodes, there may be several stesand compute nod Run the xcat command to swap the compute and non-compute node. You must run this for each non- Compute node in the failed octant. p. 69 of 119

70 swapnodes c [current node] f [target node] [current node] = the non-compute node [target node] = the Compute node that will assume the non-compute node function Note: If the non-compute octant that failed was shared by more than one non-compute node, the above command must be run for all LPARs. For example, if compute2 were the target compute node and sn1 and util1 were the nodes that occupied the failed non-compute node, and swapnodes -c sn1 -f compute2 swapnodes -c util1 -f tmpnode 7. The order may change after IO re-assignment, so run rbootseq to set the boot string for the current_node. 8. If a swapped non-compute node is a service node, perform the following procedure: j) If a stateful (diskful) service node, perform the following: i. Reinstall the service node ii. If the compute nodes managed by the service node are stateless/statelite nodes, reboot the compute nodes managed by the service node. k) If the service node is a Linux diskless service node, perform the following: iii. Boot the service node iv. Use mkhwconn to create all the connections between the hdwr_svr and CECs which are controlled by this service node v. Reboot the computes nodes managed by the service node 9. If a swapped non-compute node is used for any other purpose than a service node, boot it. 10. If the non-compute node was part of a partitioned octant, there will now be a compute node and temporary node occupying that octant. Remove the temporary node and give the compute node the entire octant. Use rmvm to remove the temporary node and use chvm to give the compute node the entire octant. For more information on these commands see the xcat document on Cluster Recovery ( 11. Register the failed non-compute node as a failed A+ resource, using: chdef -t group -o Aplus_defective members=[node] 12. This procedure ends here. Return to the procedure that pointed to this procedure Availability Plus: Recovering a hub module Perform this procedure if you have determined that a hub module has failed. You should have been sent here by another procedure. 1. Determine where the failure in the hub occurred and perform the action listed in the table below: p. 70 of 119

71 Table 10: Availability Plus Hub module symptom to action Symptom Failure Action Serviceable Event with one or more PCI locations and replacing adapters will not repair the problem. U*-P1-C1 through U*- P1-C17 B135RRRR with Torrent location; but how can we tell CAU? Similar action as with NMMU failure. Impacts just an octant. B135RRRR with Torrent location; but how can we tell ISR? Impacts an entire drawer - similar as the default "Other", below. Different from CAU and NMMU, which are octant impacts, only. B135RRRR with Torrent location; but how can we tell NMMU? Similar action as with CAU failure. Impacts just an octant. B114RRRR with P7 and Torrent in "FRU" list?? If these aren't FRUs how can we know? PCI bus failure CAU failure ISR failure NMMU failure W-bus failure A PCI slot is compromised. Perform Availability Plus: Recovering a PCIe Slot, on page 74. The performance enhancing features are compromised for this octant. If a compute node is in this octant, perform Availability Plus: Recovering a Compute Node, on page 64. If a non-compute node is in this octant, perform, Availability Plus: Recovering a non-compute node, on page 65. The network connectivity for the entire drawer is compromised. Use the procedure in Availability Plus: Recovering a failure that impacts a drawer, on page 73. The QCM in this octant can no longer reach the network: If a compute node is in this octant, perform Availability Plus: Recovering a Compute Node, on page 64. If a non-compute node is in this octant, perform, Availability Plus: Recovering a non-compute node, on page 65. The QCM in this octant can no longer reach the network: If a compute node is in this octant, perform Availability Plus: Recovering a Compute Node, on page 64. If a non-compute node is in this octant, perform, Availability Plus: Recovering a non-compute node, on page 65. p. 71 of 119

72 Symptom Failure Action B114RRRR with P7 and Torrent in "FRU" list?? If these aren't FRUs how can we know? B135RRRR that doesn't match one of the above. This has a similar action as with ISR failure. Assumed to impact an entire drawer. Power Bus failure Other failure The QCM in this octant can no longer reach the network. Also, any PCIe resources accessed via this hub are compromised. If a compute node is in this octant, perform Availability Plus: Recovering a Compute Node, on page 64. If a non-compute node is in this octant, perform, Availability Plus: Recovering a non-compute node, on page 65. Keep in mind that if there are not enough PCIe resources left in the node to support the function of the non-compute node(s), you will have to move the non-compute node(s) to the backup drawer. In that case, in the Drawer resource step, choose the node location In backup drawer. The network connectivity for the entire drawer is compromised. Use the procedure in Availability Plus: Recovering a failure that impacts a drawer, on page This procedure ends here. Return to the procedure that pointed to this procedure Availability Plus: Recovering a D-link Perform this procedure if you have determined that a D-link has failed. You should have been sent here by another procedure. 1. Determine how many D-link failures there are associated with this drawer. a. Run the following command from the EMS and grep for the unit location of the failure that has launched you into this procedure. /opt/xcat/bin/xdsh hmc -l hscroot /usr/bin/lsrsrc-api -s IBM.ServiceEvent::\"Status=\'Open\'\"::HSCId::HSCName::Level::Nodes\" grep [unit location] 2. If there are fewer than 3 D-link failures associated with this drawer, there are no recovery actions to perform. Use the following FRU location code to identify a D-link: U*-P1-R[1-8]-R[1-15] or U*-P1-R[1-8]-R28 3. If there are 3 or more D-link failures associated with this drawer, the network connectivity for this drawer is compromised and spares must be deployed to replace all of the nodes in this drawer. Run the procedure in Availability Plus: Recovering a failure that impacts a drawer, on page 73. p. 72 of 119

73 4. This procedure ends here.return to the procedure that pointed to this procedure Availability Plus: Recovering an LR-link Perform this procedure if you have determined that an LR-link has failed. You should have been sent here by another procedure. 1. Determine how many LR-link failures there are associated with this drawer. a. Run the following command from the EMS and grep for the unit location of the failure that has launched you into this procedure. /opt/xcat/bin/xdsh hmc -l hscroot /usr/bin/lsrsrc-api -s IBM.ServiceEvent::\"Status=\'Open\'\"::HSCId::HSCName::Level::Nodes\" grep [unit location] 2. If there are fewer than 3 D-link failures associated with this drawer, there are no recovery actions to perform. Use the following FRU location codes to identify an LR-link: U*-P1-R[1-8]-R[16-27] or U*-P1-R[1-8]-R[29-40] 3. If there are 3 or more LR-link failures associated with this drawer, the network connectivity for this drawer is compromised and spares must be deployed to replace all of the nodes in this drawer. Run the procedure in Availability Plus: Recovering a failure that impacts a drawer, on page This procedure ends here. Return to the procedure that pointed to this procedure Availability Plus: Recovering a failure that impacts a drawer This procedure will move resources away from a drawer that has a failure that has compromised its function. You should have been sent here by another procedure. 1. Determine if there are any non-compute nodes in the compromised drawer. You should already have determined the unit location and physical location of the drawer before arriving at this step. 2. non-compute node in drawer: If there are any non-compute nodes in the compromised drawer, perform the procedure in Availability Plus: Recovering a non-compute node, on page 65. When performing the procedure, use the option that indicates that you must go to a new drawer because there are no nodes available in the drawer to failover, because this problem impacts the entire drawer. When you have completed the referenced procedure, return here and proceed to the step labeled Compute nodes. p. 73 of 119

74 3. Compute nodes: If there are one or more Compute nodes in the compromised drawer, perform the following procedure, keeping in mind that you need to substitute a Compute node for each and every Compute node in the failed drawer: Availability Plus: Recovering a Compute Node, on page 64. If possible substitute an entire spare drawer for this compromised drawer. Use the following table to help keep track. There are 7 rows to correspond to the maximum of 7 nodes. Table 11: Tracking compromised nodes in a drawer Compromised Compute Node Spare Compute Node Spare Policy Job Management Policy Swap Complete When finished go to the end of this procedure. 4. This procedure ends here. Return to the procedure that pointed to this procedure Availability Plus: Recovering a PCIe Slot If the PCIe slot is in a CEC with no LPAR using the PCIe slot, actions may be deferred until such time that an LPAR will use the PCIe slot. If there is a spare slot in the CEC, perform the following: 1. Move the PCIe adapter in the broken slot (the one reported) to a spare slot in the CEC. 2. Reconfigure the nodes that were using the PCIe adapter to use it in the new slot by using the chvm command. For more information on doing this, see the xcat document on Cluster Recovery ( 3. ReIPL the CEC to re-partition If there is no spare slot in the CEC, you must move all of the PCIe adapters to another designated backup CEC, as well as all of the node functions that require the PCIe adapters. Use the same procedure as in Availability Plus: Recovering a failure that impacts a drawer, on page 73, but only perform the actions on the non-compute nodes and use the option for moving them to another CEC drawer. p. 74 of 119

75 Return to the procedure that pointed to this procedure Gathering data for Availability Plus resource failures To gather data for A+ resource failures use the following procedure: 1. Log on to the EMS 2. Run the xcat command /opt/xcat/sbin/gatherfip 3. Record the name of the output file that returns from the command 4. Send the output file to IBM Availability Plus: Restoring repaired non-compute nodes Use this procedure after a previously failed non-compute has been repaired. 1. Redefine the CEC with the non-compute node to the original definition. 2. Use mkhwconn to restore any CECs that were removed from the hardware server domain. 3. Use chdef node groups=[functional groups] to remove it from Aplus_defective Availability Plus: Restoring repaired Compute nodes Use this procedure after a previously failed compute node has been repaired. 1. Redefine the CEC with the Compute nodes to the original definition. 2. Use mkhwconn to restore any CECs that were removed from the hardware server domain. 3. Use chdef node groups=[functional groups] to remove it from Aplus_defective Hardware Service Locations for 9125-F2C The following is not an exhaustive list of hardware service locations for the 9125-F2C server. It lists the most important locations for this document. For a complete list, see the hardware service guide for the 9125-F2C. Hardware Service locations are constructed using a hierarchy of location identifiers that indicate how components are contained within or populated on other components. The location always begins with the enclosure's unit location, which is composed of the enclosure's feature, an instance indication and the enclosure's serial number. For the 9125-F2C, the format is like: U78A The 9125-F2C feature is always 78A9; the instance is always 001; and the example serial number is Because of the consistency of this format, a short-hand is often used for the unit location: U, or U*. All of the important locations for this document are on the planar P1. The location would look like p. 75 of 119

76 U78A P1. The following table gives ranges of locations the various important locations for this document and what components comprise those ranges. Table 12: Hardware Service Locations for 9125-F2C Location code U*-P1-R1 through U*-P1-R8 U*-P1-R9 through U*-P1-R40 U*-P1-R[1-8]-R1 through U*-P1-R[1-8]-R40 U*-P1-T1 through U*-P1-T8 U*-P1-T10 through U*-P1-T17 Components HFI network hub chips Processor chips. You can determine the octant easily because the processor locations are mapped in order to the octants. Recall, there are 4 processor chips per octant. See the table below for the mapping. HFI network optical module locations on hubs. HFI network D-link cable connector rows HFI network D-link cable connector rows U*-P1-T*-T1 through U*-P1-T*-T8 HFI network D-link cable connector ports. There are 8 per row U*-P1-T9 HFI network LR-link Note: HFI network locations are shown in more detail in HFI Network Locations, on page 52. The location identifier P indicates a planar. The location identifier R indicates a resource. In the case of the 9125-F2C, these resources are A+ resources. The location identifier T indicates a connector location. Table 13: Processor Hardware Service Location Range to Octant Location code range Processor's Octant U*-P1-R9 through U*-P1-R12 Octant 1 U*-P1-R13 through U*-P1-R16 Octant 2 U*-P1-R17 through U*-P1-R20 Octant 3 U*-P1-R21 through U*-P1-R24 Octant 4 U*-P1-R25 through U*-P1-R28 Octant 5 U*-P1-R29 through U*-P1-R32 Octant 6 U*-P1-R33 through U*-P1-R36 Octant 7 U*-P1-R37 through U*-P1-R40 Octant 8 p. 76 of 119

77 Data Collection When engaging IBM service and support, data collection can be a very important step in gathering information required to isolate a problem. Typically, this will be done under the direction of IBM. However, here are some typical first requests for data. The following table is organized by subsystem or problem followed by the data collection procedure. Subsystem/Problem ISNM network HFI network or HFI driver specific to a node TEAL XCAT Availability-Plus data AIX GPFS LoadLeveler Data Collection On the EMS: /usr/bin/cnm.snap The output file is /var/opt/isnm/cnm/log/[ems hostname] [timestamp].snap.tar.gz On the node: /usr/hfi/bin/hfi.snap The output file is a compressed tar file in: /var/adm/hfi/snaps On the EMS: The main TEAL log: /var/log/teal/teal.log Other logs: /var/log/teal/* To dump the TEAL table: tltab --dump --path /home/joe On the EMS: /opt/xcat/sbin/xcatsnap See : On the EMS: /top/xcat/sbin/gatherfip AIX snap Gathering data to solve GPFS problems in the GPFS Problem Determination Guide. This guide is referenced in the Cluster software and firmware information resources, on page 21. Information to collect prior to contacting IBM service in the LoadLeveler Diagnosis and Messages guide. This guide is referenced in the Cluster software and firmware information resources, on page 21. p. 77 of 119

78 HFI Network Event Reference The following documents HFI Network Events as reported by ISNM to TEAL. It includes all such events. Recall, that events may be turned into alerts. The AlertID will be the same for these ISNM events. Lookup the event/alert using the EventID/AlertID column based on the event_id reported in tllsevent, or the alert_id report in tllsalert, or the refcode/src reported in Service Focal Point. Some events do not get reported as alerts, because they are reported by another subsystem (such as the Power subsystem), and these events are used purely to help in analysis of the root cause of network events for which ISNM is responsible for reporting. These events will have a FRU list of NA. There are some compound alerts that are generated using pattern analysis of the network events. Therefore, their alert IDs will never be seen in the TEAL event database nor are they reported as events by ISNM. These have alert Ids that begin with BBFF. The following defines the columns in the table: EventID/AlertID Code representing the event. Maps to eventid in the TEAL event database, the alertid in the TEAL alert database, or the refcode/src in SFP. Name A brief name describing the event/alert. Alert Message A descriptive message when an alert is generated with this event or group of events. It includes location information that is represented as variables in the table, which will be updated with actual location information for a given instance of an alert. If the location variable is prefixed with nbr_, it refers to the neighbor. Possible locations are frame, cage, supernode, drawer (within supernode), hub, nodeport (D-link port), remoteport (LR-link port), localport (LL-link port). For more on locations, see HFI Network Locations, on page 52. Recommendation A brief description of what to do in response to the problem being reported. This is only reported with alerts. This should be used in conjunction with other procedures in HFI Network Alerts and Serviceable Events, on page 38 and Start Here, on page 23, and Power 775 Availability Plus actions, on page 61. FRU List This is a list of FRUs and procedures passed to Service Focal Point. Generic FRU information is in the table, whereas SFP and the alert database will have specific FRU information substituted. For details on how to read the FRU list in the table, see Generic HFI Network FRU lists, on page 111. This is only reported in alerts. p. 78 of 119

79 Table 14: HFI Network Event/Alert Reference EventID AlertID Name Alert Message Recommendation FRU List Problem with network management reporting in frame $frame cage $cage (supernode $supernode drawer $drawer) Call your next level of support and inform them that there is a general network manager BD Invalid Minor Number problem and give them the alert ID. BD BD BD BD BD BD Global Counter invalid: this counter went invalid Global Counter ID conflict Global Counter Takeover: this counter has assumed mastership Global Counter Stale: counter has been made stale by higher ID Global Counter ID overflow. Multicast Route Table Array Uncorrectable Error Problem with global counter in frame $frame cage $drawer) Problem with global counter in frame $frame cage $cage supernode $supernode drawer $drawer Problem with global counter in frame $frame cage $drawer) Problem with global counter in frame $frame cage $drawer) Problem with global counter in frame $frame cage $drawer) Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub If there is a location in the alert messages, record it. Call your next level of support and inform them that there is a problem with a global counter and give them the alert ID. ISO:HFI_LNM ISO:HFIGCTR

80 EventID AlertID Name Alert Message Recommendation FRU List BD Multicast Route Table Array Correctable Error Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub BD BD BD20000A BD20000B BD20000C BD20000D Multicast Input Array Uncorrectable Error Multicast Output Array Uncorrectable Error Multicast Input Array Correctable Error Multicast Output Array Correctable Error Multicast Packet Timeout Error Multicast Hardware Internal Error Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Problem with multicast route table in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub

81 EventID AlertID Name Alert Message Recommendation FRU List Problem with HFI ramp in frame $frame cage HFI Ramp Route Request FIFO Uncorrectable Error; left side $drawer) hub $hub BD40000E BD40000F BD BD BD BD BD HFI Ramp Route Request FIFO Uncorrectable Error; right side HFI Ramp SRT2 Hub Route Table Uncorrectable Error HFI Ramp SRT1 Supernode Route Table Uncorrectable Error HFI Ramp SRT1 Indirect Route Validity Table UE; left side HFI Ramp SRT1 Indirect Route Validity Table UE; right side HFI Ramp SRT Array Correctable Error Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub

82 EventID AlertID Name Alert Message Recommendation FRU List BD HFI Ramp SRT Hardware Internal Error Problem with HFI ramp in frame $frame cage $drawer) hub $hub BD BD BD BD BD40001A BD40001B HFI Down HFI Ramp Input Port Linked List Parity Error HFI Ramp Input Port Async IF Array Correctable Error HFI Ramp Input Port Array Correctable Error HFI Ramp Input Port Array Uncorrectable Error HFI Ramp Output Port Async IF Array Correctable Error HFI down in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub

83 EventID AlertID Name Alert Message Recommendation FRU List Problem with HFI ramp in frame $frame cage HFI Ramp Output Port Async IF Array Uncorrectable Error $drawer) hub $hub BD40001C BD40001D BD40001E BD40001F BD BD BD HFI Ramp Input Port Async IF Array Uncorrectable Error HFI Ramp Output Port Array Correctable Error HFI Ramp Output Port Array Uncorrectable Error HFI Ramp Output Port Sender Hang D Link Port Ready Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub Problem with HFI ramp in frame $frame cage $drawer) hub $hub This event should not have been reported here. them the alert ID. D Link Port Lane Width Change between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port D Link Port Lane Width Change $neighbor_nodeport Call your next level of support and inform them that there is a general network problem and give them the alert ID. This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_NET NA ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent

84 EventID AlertID Name Alert Message Recommendation FRU List This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give BD LR Link Port Ready This event should not have been reported here. them the alert ID. NA There is a problem with an LR-Link. LR Link Port Lane Width Change between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port Record the location in the alert message. $remoteport and frame $neighbor_frame cage Log on to the Management Server. LR Link Port Lane Width $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the BD Change FRUs in the order listed. BD BD BD D Link Port Down LR Link Port Down Llocal Link Port Down D-link down between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport LR-link down between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport LL-link down in frame $frame cage $cage (supernode $supernode drawer $drawer) There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2

85 EventID AlertID Name Alert Message Recommendation FRU List BD Llocal Port Input Port Linked List Parity Error LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport BD BD50002A BD50002B BD50002C BD50002D BD50002E Llocal Port Input Port Async IF Array Uncorrectable Error Llocal Port Input Port Async IF Array Correctable Error Llocal Port Input Port Array Uncorrectable Error Llocal Port Input Port Array Correctable Error Llocal Port Output Port Array Uncorrectable Error Llocal Port Output Port Array Correctable Error LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport

86 EventID AlertID Name Alert Message Recommendation FRU List BD50002F Llocal Input Port Buffer Overflow LL-link problem in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $localport BD BD BD BD BD BD Llocal Input Port VC Hang Llocal Input Port Unexpected Flit Type Llocal Input Port VC Deadlock Case Llocal Port Output Port Credit Overflow Llocal Port Output Port Sender Hang Llocal Port Output Port Hard Failure LL-link hang in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $localport LL-link problem in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $localport LL-link deadlock in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $localport LL-link problem in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $localport LL-link hang in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $localport LL-link internal problem in frame $frame cage $drawer) hub $hub port $localport

87 EventID AlertID Name Alert Message Recommendation FRU List BD D Link Inbound Port CRC Threshold Exceeded D-link problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. BD BD BD D Link Port Dropped Flit Threshold Exceeded D Link Port Total Replay Threshold Exceeded D Link Port Same Flit Retry Threshold Exceeded D-link problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport D-link problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport D-link problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent

88 EventID AlertID Name Alert Message Recommendation FRU List BD70003A D Link Port Link Up Threshold Exceeded D-link problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. BD70003B BD70003C BD70003D D Link Input Port Linked List Parity Error D Link Port Credit Overflow D Link Port Buffer Overflow D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport D-link overflow problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport. D-link overflow problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport. There is a problem with a D-Link that is likely to be a local issue. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with a D-Link that is likely to be a local issue. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port,, FRU:nbr_torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port,, FRU:nbr_torrent

89 EventID AlertID Name Alert Message Recommendation FRU List BD70003E D Link Port VC Deadlock Case D-link deadlock between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. BD70003F BD BD D Link Port Unexpected Flit Type D Link Port Sender Hang D Link Port Input Port VC Hang Timeout D-link problem between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport D-link hang between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport D-link hang between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent

90 EventID AlertID Name Alert Message Recommendation FRU List D-link internal problem in frame $frame cage D Link Input Port Async IF Array Uncorrectable Error $drawer) hub $hub port $nodeport BD D-link internal problem in frame $frame cage D Link Input Port Async IF Array Correctable Error $drawer) hub $hub port $nodeport BD D Link Input Port Array Uncorrectable Error D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport BD BD BD BD BD D Link Input Port Array Correctable Error D Link Output Port Array Uncorrectable Error D Link Output Port Array Correctable Error D Link Output Port Async IF Array Uncorrectable Err D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport

91 EventID AlertID Name Alert Message Recommendation FRU List BD D Link Output Port Async IF Array Correctable Error D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport BD70004A BD70004B BD70004C BD70004D BD70004E D Link Port Hard Failure D Port PRT1 Array Uncorrectable Error D Port PRT1 Array Correctable Error LR Link Inbound Port CRC Threshold Exceeded LR Link Port Dropped Flit Threshold Exceeded D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport D-link internal problem in frame $frame cage $drawer) hub $hub port $nodeport LR-link problem between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport LR-link problem between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2

92 EventID AlertID Name Alert Message Recommendation FRU List LR-link problem between frame $frame cage There is a problem with an LR-Link. Record the location in the alert message. $drawer) hub $hub port $remoteport and frame Log on to the Management Server. LR Link Port Total Replay $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the BD70004F Threshold Exceeded FRUs in the order listed. BD BD LR Link Port Same Flit Retry Threshold Exceeded LR Link Port Link Up Threshold Exceeded LR-link problem between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport LR-link problem between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 LR-link internal problem in frame $frame cage LR Link Input Port Linked List $drawer) hub $hub port $remoteport BD Parity Error BD LR Link Port Credit Overflow LR-link overflow problem between frame $frame There is an internal problem with a LR-Link port ISO:HFI_LDG, cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport. that is likely to be a local issue. Log on to the Management Server. Replace the FRUs in the order listed. SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:local_om1, SYM:HFI_OM:nbr_om1,, FRU:nbr_torrent, SYM:HFI_OM:local_om2,S YM:HFI_OM:nbr_om2

93 EventID AlertID Name Alert Message Recommendation FRU List BD BD LR Link Port Buffer Overflow LR Link Port VC Deadlock Case LR-link overflow problem between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport. LR-link deadlock between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport LR-link problem between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport LR Link Port Unexpected Flit BD Type BD LR Link Port Sender Hang LR-link hang between frame $frame cage $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport There is an internal problem with a LR-Link port that is likely to be a local issue. Log on to the Management Server. Replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:local_om1, SYM:HFI_OM:nbr_om1,, FRU:nbr_torrent, SYM:HFI_OM:local_om2,S YM:HFI_OM:nbr_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY

94 EventID AlertID Name Alert Message Recommendation FRU List BD BD BD70005A BD70005B BD70005C LR Link Port Input Port VC Hang Timeout LR Link Input Port Async IF Array Uncorrectable Error LR Link Input Port Async IF Array Correctable Error LR Link Input Port Array Uncorrectable Error LR Link Input Port Array Correctable Error cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. LR-link hang between frame $frame cage Record the location in the alert message. $drawer) hub $hub port $remoteport and frame Log on to the Management Server. $neighbor_frame cage $neighbor_cage To isolate to the proper FRU, run Link Diags (supernode $neighbor_supernode drawer and perform the actions that it recommends. $neighbor_drawer) hub $neighbor_hub port If no action is recommended, because Diags $neighbor_remoteport cannot isolate to the proper FRU, replace the FRUs in the order listed. LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport M:HFI_OM:local_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2

95 EventID AlertID Name Alert Message Recommendation FRU List BD70005D LR Link Output Port Array Uncorrectable Error LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport BD70005E BD70005F BD BD BD BD LR Link Output Port Array Correctable Error LR Link Output Port Async IF Array Uncorrectable Error LR Link Output Port Async IF Array Correctable Error LR Link Port Hard Failure D Link Port Optical Module Interrupt LR Link Port Optical Module Interrupt LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport LR-link internal problem in frame $frame cage $drawer) hub $hub port $remoteport D-link optical module interupt between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport LR-link optical module interupt between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport

96 EventID AlertID Name Alert Message Recommendation FRU List BD7000A0 BD7000A1 D Link up with 9 lanes D Link up with 8 lanes D Link Port Lane Width Change between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport D Link Port Lane Width Change between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport LR Link Port Lane Width Change between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport BD7000A8 LR Link up with 5 lanes BD7000A9 LR Link up with 4 lanes LR Link Port Lane Width Change between frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2 ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2

97 EventID AlertID Name Alert Message Recommendation FRU List BD BD BD BD BD BD TX optical module Temp High TX optical module Temp Low TX optical module Vcc3.3 High TX optical module Vcc3.3 Low TX optical module Vcc2.5 High TX optical module Vcc2.5 Low Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule FRUs in the order listed. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2

98 EventID AlertID Name Alert Message Recommendation FRU List BD88007A RX optical module Temp High Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule There is a problem with an optical module. BD88007B BD88007C BD88007D BD88007E BD88007F BD RX optical module Temp Low RX optical module Vcc3.3 High RX optical module Vcc3.3 Low RX optical module Vcc2.5 High RX optical module Vcc2.5 Low TX Loss of Signal Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module. FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2

99 EventID AlertID Name Alert Message Recommendation FRU List BD TX Fault Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule There is a problem with an optical module. BD BD BD BD TX Bias Current High TX Light Output Power Low TX Bias Current Low TX Light Output Power High Optical Module event caused a link event in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub optical module $opticalmodule Optical Module event caused a link event in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule Optical Module event caused a link event in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub optical module $opticalmodule BD RX Loss of Signal BD RX Light Input Power Low Optical Module event caused a link event in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub optical module $opticalmodule There is a problem with an optical module problem that impacted the link. There is a problem with an optical module problem that impacted the link. There is a problem with an optical module. There is a problem with an optical module. There is a problem with an optical module problem that impacted the link. There is a problem with an optical module problem that impacted the link. FRU:local_om1, SYM:HFI_OM:local_om2 SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port,, FRU:nbr_torrent SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port,, FRU:nbr_torrent FRU:local_om1, SYM:HFI_OM:local_om2 FRU:local_om1, SYM:HFI_OM:local_om2 SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port,, FRU:nbr_torrent SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port,, FRU:nbr_torrent

100 EventID AlertID Name Alert Message Recommendation FRU List BD RX Light Input Power High Optical Module event in frame $frame cage $drawer) hub $hub optical module $opticalmodule BD Processor 2 CEC DCA 1 This event should not have been reported here. them the alert ID. BD Processor 2 CEC DCA 2 This event should not have been reported here. them the alert ID. BD Processor 3 CEC DCA 1 This event should not have been reported here. them the alert ID. BD Processor 3 CEC DCA 2 This event should not have been reported here. them the alert ID. BD Processor 4 CEC DCA 1 This event should not have been reported here. them the alert ID. There is a problem with an optical module. This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give BD Processor 4 CEC DCA 2 This event should not have been reported here. them the alert ID. BD Processor 5 CEC DCA 1 This event should not have been reported here. This event should not have generated an alert. Call your next level of service and indicate that FRU:local_om1, SYM:HFI_OM:local_om2 NA NA NA NA NA NA NA

101 EventID AlertID Name Alert Message Recommendation FRU List there is a problem with event analysis and give them the alert ID. BD Processor 5 CEC DCA 2 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD Processor 6 CEC DCA 1 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD Processor 6 CEC DCA 2 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD Processor 7 CEC DCA 1 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD Processor 7 CEC DCA 2 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD Processor 8 CEC DCA 1 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD Processor 8 CEC DCA 2 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD Processor 9 CEC DCA 1 This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA

102 EventID AlertID Name Alert Message Recommendation FRU List This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give BD Processor 9 CEC DCA 2 This event should not have been reported here. them the alert ID. NA Overvoltage event in Octant This event should not have generated an alert. Call your next level of service and indicate that This event should not have been reported here. there is a problem with event analysis and give BD0204F0 frame $frame cage hub $hub. them the alert ID. NA BD0204FF BD BD02007A BD02007B BD02007C BD02007D BD02007E Overvoltage event in CEC High Ambient Tempurature - System will be powered off MCM Fault Detected 1 MCM Fault Detected 2 MCM Fault Detected 3 MCM Fault Detected 4 MCM thermal sensor issues; CEC will not power on. This event should not have been reported here. frame $frame cage $cage. This event should not have been reported here. them the alert ID. This event should not have been reported here. them the alert ID. This event should not have been reported here. them the alert ID. This event should not have been reported here. them the alert ID. This event should not have been reported here. them the alert ID. This event should not have been reported here. This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give them the alert ID. This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give them the alert ID. NA NA NA NA NA NA NA

103 EventID AlertID Name Alert Message Recommendation FRU List No Airflow in all DCAs This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give BD This event should not have been reported here. them the alert ID. NA This event should not have generated an alert. Call your next level of service and indicate that Overcurrent event in an Octant This event should not have been reported here. there is a problem with event analysis and give BD0203F0 Frame $frame cage $cage hub $hub. them the alert ID. NA BD0203FF Overcurrent event in a CEC This event should not have been reported here. Frame $frame cage $cage. BD CEC performance reduced. This event should not have been reported here. them the alert ID. BD CEC performance reduced. This event should not have been reported here. them the alert ID. BD BD0200F2 BD0200F3 BD0200F0 CEC power dropped due to MCM over temperature. Over temperature severe warning; system will be powered off. System recovered from Over Temperature condition. System will be powered off within 10 minutes. This event should not have been reported here. them the alert ID. This event should not have been reported here. them the alert ID. This event should not have been reported here. them the alert ID. This event should not have been reported here. This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give them the alert ID. This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give them the alert ID. NA NA NA NA NA NA NA

104 EventID AlertID Name Alert Message Recommendation FRU List This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give BD0200F1 System recovered from Over Temperature warning. This event should not have been reported here. them the alert ID. BD UEPO Power Cycled. This event should not have been reported here. them the alert ID. BD022C51 DEFECTIVE DCCA_01 This event should not have been reported here. them the alert ID. BD022C52 DEFECTIVE DCCA_02 This event should not have been reported here. them the alert ID. BD BD NO_AIR_FLOW_IN_ALL_DCA This event should not have been reported here. them the alert ID. LOGIC_OverTemp_DCA_OFF This event should not have been reported here. them the alert ID. BD Torrent Functional Power-On This event should not have been reported here. them the alert ID. This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give BD Torrent Functional Power-Off This event should not have generated an alert. Call your next level of service and indicate that there is a problem with event analysis and give This event should not have been reported here. them the alert ID. NA BD0000FF There is a problem communicating with both the ISO:HFINSFP Cannot connect to primary or backup HMC Cannot connect to the the primary nor the backup HMC for reporting serviceable events. primary and backup HMC for logging network serviceable events. Contact IBM Service and report the alert ID. Check the network connections between the NA NA NA NA NA NA NA

105 EventID AlertID Name Alert Message Recommendation FRU List BD No Neighbor Found ISNM cannot find a neighbor for an event that should have one. BD00FFF0 on_error - location ISNM has reported an invalid location to TEAL. EMS and HMCs. If the network problem cannot be isolated and repaired, contact the next level of support. A neighbor was not found for an event that requires one. Contact IBM Service and report the alert ID. Determine the neighbor, and check for events reported by it, because event analysis would not have been able to relate events on the neighbor to this event; this can result in reporting events that are not serviceable. Record the alert ID, and call your next level of support. ISO:HFINNBR BD CNM failed to gather VPD for a system CNM could not gather system VPD for Frame $frame cage $cage (supernode $supernode drawer $drawer) CNM could not gather VPD from a system. ISO:HFI_NOVPD BD CNM failed to gather VPD for an enclosure CNM could not gather enclosure VPD for Frame CNM could not gather VPD from a system. $frame cage $cage (supernode $supernode drawer $drawer) ISO:HFI_NOVPD BD CNM failed to gather VPD for a planar CNM could not gather planar VPD for Frame $frame cage $cage (supernode $supernode drawer $drawer) CNM could not gather VPD from a system. ISO:HFI_NOVPD BD BDFF0000 BDFF0010 CNM failed to gather VPD for a hub chip Faulty D Link Optical Module Faulty D Link Hub CNM could not gather hub VPD for Frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Faulty optical module in frame $frame cage $drawer) hub $hub Faulty hub in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub CNM could not gather VPD from a system. There is a problem with an optical module. ISO:HFI_NOVPD FRU:local_om1, SYM:HFI_OM:local_om2

106 EventID AlertID Name Alert Message Recommendation FRU List BDFF0020 Faulty LR Link Optical Module Faulty optical module in frame $frame $cage (supernode $supernode drawer $drawer) hub $hub BDFF0030 Faulty LR Link Hub Faulty hub in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub BDFF0040 BDFF004A Faulty Optical Module Both ramps faulty in HFI hub Faulty optical module in frame $frame cage $drawer) hub $hub Both HFI ramps with problems frame $frame cage $drawer) hub $hub There is a problem with a hub chip. Replace the FRU according to FRU:common_torrent BDFF0050 BDFF0055 Faulty Network Hub Faulty LR-link cable assembly - 48 or more links Faulty hub in frame $frame cage $cage (supernode $supernode drawer $drawer) hub $hub Faulty LR link cable assembly \in frame $frame cage $drawer) affected 48 or more links. There is a problem with a hub chip. Replace the FRU according to There is a problem with a group of HFI ports associated with an LR Link Assembly. The problem may be with a particular handle in the assembly. Log on to the Management Server. The problem may be resolved by reseating or replacing the LR-link assembly. If there are operational LR-links remaining in the supernode, any service action can be very disruptive to the network. On the EMS, use the lsnwlinkinfo command to query the current state of the LR-links. FRU:common_torrent SYM:HFI_LRA:common_po rt

107 EventID AlertID Name Alert Message Recommendation FRU List There is a problem with a group of HFI ports associated with an LR Link Assembly. The problem may be resolved by reseating or replacing the LR-link assembly. If there are operational LR-links remaining in the supernode, any service action can be very Faulty or removed LR link cable assembly in disruptive to the network. frame $frame $cage (supernode $supernode On the EMS, use the lsnwlinkinfo command to drawer $drawer) afftecting 192 links. query the current state of the LR-links. BDFF0056 BDFF0057 BDFF0058 BDFF0060 Faulty LR-link cable assembly - full Faulty LR-link cable assembly - 64 or more links Faulty LR-link cable assembly or more links Suspicious Drawer Faulty or removed LR link cable assembly in frame $frame cage $cage (supernode $supernode drawer $drawer) affecting 64 or more links. Faulty or removed LR link cable assembly in frame $frame cage $cage (supernode $supernode drawer $drawer) affecting 128 or more links. Drawer level event occurred on frame $frame cage $drawer). There is a problem with a group of HFI ports associated with an LR Link Assembly. The problem may be resolved by reseating or replacing the LR-link assembly. If there are operational LR-links remaining in the supernode, any service action can be very disruptive to the network. On the EMS, use the lsnwlinkinfo command to query the current state of the LR-links. There is a problem with a group of HFI ports associated with an LR Link Assembly. The problem may be resolved by reseating or replacing the LR-link assembly. If there are operational LR-links remaining in the supernode, any service action can be very disruptive to the network. On the EMS, use the lsnwlinkinfo command to query the current state of the LR-links. SYM:HFI_LRA:common_po rt SYM:HFI_LRA:common_po rt A large number of HFI network links attached to a drawer are down without an accompanying power event. Contact IBM Service and report the alert ID. If a drawer lost power, then this is a secondary effect. ISO:HFI_IDR SYM:HFI_LRA:common_po rt

108 EventID AlertID Name Alert Message Recommendation FRU List A large number of HFI network links attached to a supernode are down without an Suspicious SuperNode SuperNode level event occurred on frame $frame supernode $supernode. Multiple drawers in a supernode are impacted. accompanying power event. Contact IBM Service and report the alert ID. If a supernode lost power, then this is a BDFF0061 secondary effect. BDFF0062 BDFF0063 Suspicious Frame Suspicious Cluster BDFF0070 Network Fault HFI Network Fault BDFF0080 BDFF0082 Bouncing D-Link Bouncing LR-Link Frame level event occurred in frame $frame. Multiple CECs are impacted. Cluster level event occurred. Multiple frames are impacted. D-link bouncing between frame $frame cage $drawer) hub $hub port $nodeport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_nodeport LR-link bouncing between frame $frame cage $drawer ) hub $hub port $remoteport and frame $neighbor_frame cage $neighbor_cage (supernode $neighbor_supernode drawer $neighbor_drawer) hub $neighbor_hub port $neighbor_remoteport A large number of HFI network links in a Frame are down without an accompanying power event. If a frame lost power, then this is a secondary effect. A large number of HFI network links in a Cluster are down without an accompanying power event. If multiple frames or drawers lost power, then this is a secondary effect. Call your next level of support and inform them that there is a general network problem and give them the alert ID. There is a problem with a D-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. There is a problem with an LR-Link. Record the location in the alert message. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. ISO:HFI_ISN ISO:HFI_SFR ISO:HFI_ICL ISO:HFI_NET ISO:HFI_DDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, FRU:nbr_torrent,FRU:local _torrent ISO:HFI_LDG, SYM:HFI_CAB:nbr_port, SYM:CBLCONT:local_port, SYM:HFI_OM:nbr_om1, SYM:HFI_OM:local_om1, FRU:nbr_torrent,FRU:local _torrent, SYM:HFI_OM:nbr_om2,SY M:HFI_OM:local_om2

109 EventID AlertID Name Alert Message Recommendation FRU List BDFF00A0 Random D-links in a Frame Frame $frame has a random set of D-links reporting events at about the same time. A random set of D-links have issues at the frame level. BDFF00A1 BDFF00A2 BDFF00A5 BDFF00A6 BDFF00A7 BDFF00AA BDFF00AB BDFF00AC Random LR-links in a Frame Random Optical Modules in a Frame Random D-links in a SuperNode Random LR-links in a SuperNode Random Optical Modules in a SuperNode Random D-links in a Drawer Random LR-links in a Drawer Random Optical Modules in a Drawer Frame $frame has a random set of LR-links reporting events at about the same time. Frame $frame has a random set of optical modules reporting events at about the same time. SuperNode $supernode has a random set of D- links reporting events at about the same time. Frame $frame supernode $supernode SuperNode $supernode has a random set of LR-links reporting events at about the same time. Frame $frame supernode $supernode SuperNode $supernode has a random set of optical modules reporting events at about the same time. Frame $frame supernode $supernode A drawer has a random set of D-links reporting events at about the same time. Frame $frame cage $drawer) A drawer has a random set of LR-links reporting events at about the same time. Frame $frame cage $drawer) A drawer has a random set of optical modules reporting events at about the same time. Frame $frame cage $cage (supernode $supernode drawer $drawer) A random set of LR-links have issues at the frame level. A random set of optical modules have issues at the frame level. A random set of D-links have issues at the supernode level. A random set of LR-links have issues at the supernode level. A random set of optical modules have issues at the supernode level. A random set of D-links have issues at the drawer level. Contact IBM Service and report the alert ID. A random set of LR-links have issues at the drawer level. ISO:HFI_RDL ISO:HFI_RLR ISO:HFI_ROM ISO:HFI_RDL ISO:HFI_RLR ISO:HFI_ROM ISO:HFI_RDL ISO:HFI_RLR A random set of optical modules have issues at the drawer level. ISO:HFI_ROM

110

111 Generic HFI Network FRU lists When referencing HFI Network Alert FRU lists by EventID/AlertID, it is important to recognize that these are generic lists until the specific FRU information can be substituted for a given alert. They are comma separated and listed in the order in which they should be investigated as the root cause. If the FRU list is NA, then there should not be an alert generated for this eventid. If there is one generated, this is a problem, as indicated in the Alert Message and Recommendation. The generic FRU lists use a simple format to define a FRU: [FRU type]:[fru location Procedure name] or [FRU type]:[procedure name]:[fru location] The longer format ([FRU type]:[procedure Name]:[FRU location]) is only used for Symbolic FRUs (see the FRU types in the table, below) FRU types are: Table 15: FRU Types in Generic FRU List ISO SYM FRU Isolation procedure., which is used to help further diagnose root cause or otherwise recommend a course of action. Typically, HFI Network Alert FRU Lists start with an isolation procedure. For a list of HFI Network isolation procedures, see HFI Isolation Procedures:, on page 43 Symbolic procedure, which represents a physical FRU. This can be because there is no VPD for this FRU or there is no proper location for it. For a list of HFI Network Symbolic Procedures, see HFI Symbolic Procedures:, on page 50 A physical FRU with proper VPD data and location information that can define it. FRU locations are in the following table. They have prefixes of local or nbr or common to indicate which end of a cable will be reported. For example, a neighbor hub chip is represented by nbr_torrent. The prefix common is used for compound alerts that combine multiple events into a single alert. It cannot be known ahead of time whether the local or the nbr is important. Therefore, the code will search for a location that is common to each event that compromises the compound alert. The table will leave off the prefix. Table 16: FRU locations in Generic FRU List _torrent _port _om1 _om2 The HFI Hub chip. A port on the HFI Hub chip. It is typically used to indicate a cable location. An optical module location. A second optical module location. This is only used for LR-links which are deployed as two per optical module, and thus have one location for each LR-link on an optical module. p. 111 of 119

112 Procedure names all begin with HFI. They are seven characters long. For listings of the various Procedure names, see the section appropriate for the FRU type. For Isolation Procedures, see HFI Isolation Procedures:, on page 43. For Symbolic Prpcedures, see HFI Symbolic Procedures:, on page 50. p. 112 of 119

113 3 Notices This information was developed for products and services offered in the U.S.A. The manufacturer may not offer the products, services, or features discussed in this document in other countries. Consult the manufacturer's representative for information on the products and services currently available in your area. Any reference to the manufacturer's product, program, or service is not intended to state or imply that only that product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any intellectual property right of the manufacturer may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any product, program, or service. The manufacturer may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to the manufacturer. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: THIS INFORMATION IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. The manufacturer may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to Web sites not owned by the manufacturer are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this product and use of those Web sites is at your own risk. The manufacturer may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning products not produced by this manufacturer was obtained from the suppliers of those products, their published announcements or other publicly available sources. This manufacturer has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to products not produced by this manufacturer. Questions on the capabilities of products not produced by this manufacturer should be addressed to the suppliers of those products. All statements regarding the manufacturer's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. p. 113 of 119

114 The manufacturer's prices shown are the manufacturer's suggested retail prices, are current and are subject to change without notice. Dealer prices may vary. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. If you are viewing this information in softcopy, the photographs and color illustrations may not appear. The drawings and specifications contained herein shall not be reproduced in whole or in part without the written permission of the manufacturer. The manufacturer has prepared this information for use with the specific machines indicated. The manufacturer makes no representations that it is suitable for any other purpose. The manufacturer's computer systems contain mechanisms designed to reduce the possibility of undetected data corruption or loss. This risk, however, cannot be eliminated. Users who experience unplanned outages, system failures, power fluctuations or outages, or component failures must verify the accuracy of operations performed and data saved or transmitted by the system at or near the time of the outage or failure. In addition, users must establish procedures to ensure that there is independent data verification before relying on such data in sensitive or critical operations. Users should periodically check the manufacturer's support websites for updated information and fixes applicable to the system and related software. Ethernet connection usage restriction This product is not intended to be connected directly or indirectly by any means whatsoever to interfaces of public telecommunications networks. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information at Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Red Hat, the Red Hat "Shadow Man" logo, and all Red Hat-based trademarks and logos are trademarks or registered trademarks of Red Hat, Inc., in the United States and other countries. UNIX is a registered trademark of The Open Group in the United States and other countries. p. 114 of 119

115 Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Electronic emission notices Class A Notices The following Class A statements apply to the IBM servers that contain the POWER7 processor and its features unless designated as electromagnetic compatibility (EMC) Class B in the feature information. Federal Communications Commission (FCC) statement Note: This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instruction manual, may cause harmful interference to radio communications. Operation of this equipment in a residential area is likely to cause harmful interference, in which case the user will be required to correct the interference at his own expense. Properly shielded and grounded cables and connectors must be used in order to meet FCC emission limits. IBM is not responsible for any radio or television interference caused by using other than recommended cables and connectors or by unauthorized changes or modifications to this equipment. Unauthorized changes or modifications could void the user's authority to operate the equipment. This device complies with Part 15 of the FCC rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation. Industry Canada Compliance Statement This Class A digital apparatus complies with Canadian ICES-003. Avis de conformité à la réglementation d'industrie Canada Cet appareil numérique de la classe A est conforme à la norme NMB-003 du Canada. European Community Compliance Statement This product is in conformity with the protection requirements of EU Council Directive 2004/108/EC on the approximation of the laws of the Member States relating to electromagnetic compatibility. IBM cannot accept responsibility for any failure to satisfy the protection requirements resulting from a non-recommended modification of the product, including the fitting of non-ibm option cards. This product has been tested and found to comply with the limits for Class A Information Technology Equipment according to European Standard EN The limits for Class A equipment were derived for commercial and industrial environments to provide reasonable protection against interference with licensed communication equipment. European Community contact: IBM Deutschland GmbH p. 115 of 119

116 Technical Regulations, Department M456 IBM-Allee 1, Ehningen, Germany Tele: Warning: This is a Class A product. In a domestic environment, this product may cause radio interference, in which case the user may be required to take adequate measures. VCCI Statement - Japan The following is a summary of the VCCI Japanese statement in the box above: This is a Class A product based on the standard of the VCCI Council. If this equipment is used in a domestic environment, radio interference may occur, in which case, the user may be required to take corrective actions. Japanese Electronics and Information Technology Industries Association (JEITA) Confirmed Harmonics Guideline (products less than or equal to 20 A per phase) Japanese Electronics and Information Technology Industries Association (JEITA) Confirmed Harmonics Guideline with Modifications (products greater than 20 A per phase) Electromagnetic Interference (EMI) Statement - People's Republic of China p. 116 of 119

Declaration: This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may need to perform practical action.

117 Declaration: This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may need to perform practical action. Electromagnetic Interference (EMI) Statement - Taiwan The following is a summary of the EMI Taiwan statement above. Warning: This is a Class A product. In a domestic environment this product may cause radio interference in which case the user will be required to take adequate measures. IBM Taiwan Contact Information: Electromagnetic Interference (EMI) Statement - Korea Please note that this equipment has obtained EMC registration for commercial use. In the event that it has been mistakenly sold or purchased, please exchange it for equipment certified for home use. Germany Compliance Statement Deutschsprachiger EU Hinweis: Hinweis für Geräte der Klasse A EU-Richtlinie zur Elektromagnetischen Verträglichkeit Dieses Produkt entspricht den Schutzanforderungen der EU-Richtlinie 2004/108/EG zur Angleichung der Rechtsvorschriften über die elektromagnetische Verträglichkeit in den EU-Mitgliedsstaaten und hält die Grenzwerte der EN Klasse A ein. Um dieses sicherzustellen, sind die Geräte wie in den Handbüchern beschrieben zu installieren und zu p. 117 of 119

118 betreiben. Des Weiteren dürfen auch nur von der IBM empfohlene Kabel angeschlossen werden. IBM übernimmt keine Verantwortung für die Einhaltung der Schutzanforderungen, wenn das Produkt ohne Zustimmung von IBM verändert bzw. wenn Erweiterungskomponenten von Fremdherstellern ohne Empfehlung von IBM gesteckt/eingebaut werden. EN Klasse A Geräte müssen mit folgendem Warnhinweis versehen werden: "Warnung: Dieses ist eine Einrichtung der Klasse A. Diese Einrichtung kann im Wohnbereich Funk-Störungen verursachen; in diesem Fall kann vom Betreiber verlangt werden, angemessene Maßnahmen zu ergreifen und dafür aufzukommen." Deutschland: Einhaltung des Gesetzes über die elektromagnetische Verträglichkeit von Geräten Dieses Produkt entspricht dem Gesetz über die elektromagnetische Verträglichkeit von Geräten (EMVG). Dies ist die Umsetzung der EU-Richtlinie 2004/108/EG in der Bundesrepublik Deutschland. Zulassungsbescheinigung laut dem Deutschen Gesetz über die elektromagnetische Verträglichkeit von Geräten (EMVG) (bzw. der EMC EG Richtlinie 2004/108/EG) für Geräte der Klasse A Dieses Gerät ist berechtigt, in Übereinstimmung mit dem Deutschen EMVG das EG-Konformitätszeichen - CE - zu führen. Verantwortlich für die Einhaltung der EMV Vorschriften ist der Hersteller: International Business Machines Corp. New Orchard Road Armonk, New York Tel: Der verantwortliche Ansprechpartner des Herstellers in der EU ist: IBM Deutschland GmbH Technical Regulations, Abteilung M456 IBM-Allee 1, Ehningen, Germany Tel: tjahn@de.ibm.com Generelle Informationen: Das Gerät erfüllt die Schutzanforderungen nach EN und EN Klasse A. Electromagnetic Interference (EMI) Statement - Russia Terms and conditions Permissions for the use of these publications is granted subject to the following terms and conditions. Personal Use: You may reproduce these publications for your personal, noncommercial use provided that all p. 118 of 119

Power Systems. IBM Power 595 (9119-FHA) removal and replacement procedures

Power Systems. IBM Power 595 (9119-FHA) removal and replacement procedures Power Systems IBM Power 595 (9119-FHA) removal and replacement procedures Power Systems IBM Power 595 (9119-FHA) removal and replacement procedures Note Before using this information and the product it