UCS C Series TAC Time Andreas Nikas Technical Leader
Agenda Introduction Working with TAC Web Resources New Hardware New Software Troubleshooting Top Issues
Introduction Ken Krzyzewski Technical Leader Data Center Carlos Lopez Technical Leader Data Center Jose Martinez Technical Leader Data Center Andreas Nikas, Technical Leader Data Center Patrick Reardon C Series SME
Working with TAC
A problem clearly stated is a problem half solved. Charles F. Kettering Head of research GM
Information Gathering What to collect for the Service Request Detailed Problem description When did it start happening Is it reproducible? Logs: tech-support, OS logs, sosreports, mcelogs, LSI MegaCLI logs System Info: Software Version (CICM,BIOS,OS) Hardware (Server model, adapter types & version [enic/fnic]) Additional Info: Screenshots, topology etc
Analysis / Corrective Actions What corrective actions have been taken and what was the outcome? Has the system been rebooted? Has the system been upgraded? Check system supportability using the UCS HW and SW Interoperability Tool Check the UCS C - Series Field Notices for possible matches Check the Release Notes for known open caveats
Web Resources
Web Resources - Technical UCS C Series Field Notices http://www.cisco.com/en/us/products/ps10493/prod_field_notices_list.html UCS C Series Release Notes http://www.cisco.com/en/us/products/ps10739/prod_release_notes_list.html Compatibility Matrix http://www.cisco.com/en/us/products/ps10477/prod_technical_reference_list.html Storage Compatibility Matrix http://www.cisco.com/en/us/docs/switches/datacenter/mds9000/interoperability/matrix/ Matrix8.html Support Community https://supportforums.cisco.com/index.jspa
Web Resources - Tools Service Request http://www.cisco.com/cisco/web/support/index.html#~shp_contact Bug Search https://tools.cisco.com/bugsearch/?referring_site=shp Online Web Returns (POWR Tool) http://www.cisco.com/web/ordering/cs_info/or3/o32/return_a_product/webreturns/prod uct_online_web_returns.html RMA http://www.cisco.com/en/us/docs/rma/3582.html Cisco Notification Service http://www.cisco.com/cisco/support/notifications.html
Partner Resources Data Center Partner Portal https://www.myciscocommunity.com/community/partner/datacenter Partner Help Desk (800)-GO-CISCO
New Hardware
Single Server Dual CPU socket per server Up to 4GB RAID Cache Enterprise storage features Up to 256GB Memory 8 DIMMs per socket Dual Modular LOM (mlom) Multiple Connectivity Options Up to 62 Drive Bays 60 LFF, plus 2 SFF Optional Bezel UCS C3160 Dense Rack Server
HDD 4 Rows of hotswappable HDD 4TB/6TB Total top load: 56 drives Server Node 2x E5-2600 V2 CPUs 128/256GB RAM 1GB/4GB RAID Cache FAN 8 hot-pluggable fans Two 120GB SSDs OS/Boot Optional Disk Expansion 4x hot-swappable, rear-load LFF 4TB/6TB HDD System I/O Controller (SIOC) Cisco mlom Slot Power Supply 4 hot-pluggable PSUs
What Specifications Quantity Required Base Chassis: UCS C3160 Base Chassis, 2x 120GB SSDs, 4 PSU, 1 Rail Kit Min of 1 Chassis Server Node: 4 Workload Specific Configured Nodes Available Drives: Main Drive Bay Filled by rows of 14 Capacity: 224TB @ FCS 336TB Post-FCS SIOC: System I/O Controller with Cisco mlom Optional Disk Expansion Node: 16TB @ FCS/24TB Post FCS Physical Dimensions Stand-alone Management 2x E5-2620 V2/128 GB/1GB RAID Cache 2x E5-2620 V2/256 GB/4GB RAID Cache 2x E5-2660 V2/256 GB/4GB RAID Cache 2x E5-2695 V2/256 GB/4GB RAID Cache 14x 4TB 7200RPM LFF 28x 4TB 7200RPM LFF 42x 4TB 7200RPM LFF 56x 4TB 7200RPM LFF Cisco VIC 1227 10GbE SFP+ (Dual) Intel i350 mlom NIC (Quad 1GbE) Tray+4x 4TB 7200 LFF @ FCS, 4x6TB Post FCS 4U height / 31.8 inch depth Cisco Integrated Management Controller - CIMC Must choose 1 Server Node Must choose at least 1 row of disks. Note: Post-FCS 6TB rows will be available Requires min 1/max 2 SIOC Requires 1 NIC per SIOC Optional Note: All LFF Drives on the UCS C3160 has to be of the same capacity/type
Cisco IMC 2.0 Introduction Cisco IMC Firmware Release Naming: CIMC Cisco IMC (In adherence with Cisco branding rules) 1.X 2.X (This will be the 2.0 Release) Release will be branded Cisco IMC 2.0 Supported Platforms: C-Series 22 M3 & C-Series 24 M3 C-Series 220 M3 & C-Series 240 M3 C-Series M4 Platforms (Future) C-Series 3160
Cisco IMC 2.0 New Themes & Features Storage Configuration: Local storage management using XML API Advanced RAID Configuration Options via WebUI, CLI and XML API Networking: Cisco IMC IPv6 Support Dynamic DNS Issue ping from WebUI Monitoring: SNMP Phase4 (+ storage changes) Syslog Enhancements (Port) DIMM Blacklisting: Phase2 Fault Engine History Platform Event Traps (PET) removed, PEF Updated
Cisco IMC 2.0 New Themes & Features Security BIOS Signing (Signed Update Checking) Secure CIMC Support (Signed Update Checking) UEFI Secure Boot (Windows 2012) PSIRT fixes password hashing using standard algorithms (SHA512) General Precision Boot order control - CLI, Web UI, XML API support Import/Export Enhancements Display FW version of PCIe adapters Standardize BIOS tokens Local User Enhancements KVM Enhancements (Power Controls, Last Boot Capture, Digital Video Recorder Capability, Chat Capability) SCU - ESXi OS install and SAN based ESXi install
Raid Controller CSCuh86924 ESXi PSOD PF exception 14, LSI Raid Controller 9266-8i Problem is due to marginal voltage level on an internal voltage rail. Primarily impacts ESXi 4, ESXi 5 and Red Hat 6.4. MegaCLI lsi-fwterm.log will contain one or more of the following messages: "Pmu Msg Fault!!! faultcode 00002651 "Pmu Msg Fault!!! faultcode 00002656" "Pmu Msg Fault!!! faultcode 0000620B" "Controller encountered a fatal error and was reset" (multiple instances) Manufacturing New RMA
HDD CSCul25263 Seagate 146/300/500GB/1TB/2TB HDDs may stop responding Look for messages in the SEL logs and/or the OBFL logs. You may see a combination of the following messages: RAID / Controller Physical / Virtual Drive Shows as degraded or failed If you see any of the these messages listed in the example below then you may be hitting this issue. BMC:storage:-: SLOT-5: VD 01/1 is now DEGRADED "HDD_07_STATUS: Drive Slot sensor, Drive Fault was asserted". To verify this issue you will have to power cycle the HDD / Server If the issue clears then you have hit this issue and the workaround is temporary. Next, in the collected tech support find the following file /mnt/jffs2/storage-data
HDD - Continued CSCul25263 Seagate 146/300/500GB/1TB/2TB HDDs may stop responding The following is an excerpt from the storage-data file. The drive firmware needs to be upgraded. The key information is highlighted. %controller "SLOT-4" %type "RAID" %physical-drive "4" %group inquiry-data The following Seagate hard drives are affected: Seagate 146GB SAS 15K Hard Drive - ST9146853SS - Firmware version 0002 Seagate 300GB SAS 15K Hard Drive - ST9300653SS - Firmware version 0002 Seagate 500GB SATA 7.2K Hard Drive - ST500NM0011 - Firmware version 0001 Seagate 1TB SAS 7.2K Hard Drive - ST1000NM0001 - Firmware version 0001 Seagate 2TB SAS 7.2K Hard Drive - ST2000NM0001 - Firmware version 0001 +has-error: No +vendor: SEAGATE +product-id:st9300653ss +product-revision-level: 0002 <<======= Need Version 4 or >
Troubleshooting What if you still have HDD or Raid Controller Issues? For HDD issues Try reversing the LSI cable ends and check connectivity (Polarity Sensitive) Check \mnt\jffs2\fw_update and if one or more components are not up to date then use update_all Check \mnt\jffs2\storage-data for media errors and failures For Raid Controller issue Check Controller firmware and BIOS If they are not updated then check if HUU detects it and try and upgrade. Collect MegaCLi logs. In the latest code you can download the ttylogs in the same place as a show tech. In 2.0 some of the logs are included in the show tech itself.
Thank you.
Backup
Cisco IMC 2.0 Features DIMM Blacklisting Gets ECC counts from IPMI ECC sensor Makes blacklisting decisions based on IPMI sensor reading Blacklists DIMM when Uncorrectable ECC error is encountered Maps-out blacklisted DIMM in the subsequent reboot of Host by communicating the decision to BIOS Map-out means, DIMM is excluded from Host memory configuration Maintains ECC counts across reboots in the Blacklisting database Stores the Blacklisting database on DIMM SPD.
Cisco IMC 2.0 Features Continued HUU Low Level Firmware Update Till not low level components like FPGA, CPLD, etc. could only be updated via CIMC Cli and had to be done manually. With Eagle Peak release the Low Level Firmware will be updated along with HUU Update. Update All In this case no questions will be asked and Low Level Firmware, if require an update will be updated automatically when CIMC is activated after HUU update. Update CIMC If CIMC is selected with any other component in HUU for update the user will be prompted for Low Level Firmware update. If user selects yes then the on CIMC activation the components will be updated
Cisco IMC 2.0 Features Continued vkvm/vmedia Enhancements KVM and vmedia Reconnect Status Bar Host Power Control DVR and Video Player Exporting Recorded Video Server Side (BMC) Video Capture Video Scaling Mini-mode Chat
Troubleshooting Backup
Replacing a Failed Disk Do NOT replace a disk while the server is shut down! May result in an inactive or offline disk. Replace the disk while after the OS, or at very least RAID Controller option ROM has loaded. An inactive disk is when conflicting COD (configuration on disk) partitions are detected on multiple disks. As the controller does not know which is correct, it disables (makes inactive) one of the configurations. This should only happen during boot time (If the new disk was added when the Server is switched off). If the new disk is added when the Option ROM is already loaded (or the O/S is booted) then the Controller will know which of the disks is foreign (new) and will automatically start a rebuild (if required) onto it.
LSI MegaRaid GUI Install on Windows Unzip and go to the Disk1 Directory Run Setup Will need to install MS C++ components
LSI MegaRaid Physical View Start MegaRAID from Start menu Get this initial screen
LSI MegaRaid Physical View
LSI MegaRaid GUI Install on Linux A little bit harder than Windows install Unzip/tar download file Cd disk Run./install.sh Press Y to accept License Agreement Select 3 to install Standalone This will install Lib-Utils and Lib-Utils2 Might seen an error about snmpd you can ignore it GUI gets installed in /usr/local/megaraid Storage Manager Startupui.sh runs the gui Works/looks just like Windows GUI
LSI MegaRaid Physical View - Linux
LSI MegaRaid GUI Install on VMware Similar to Linux Install Unzip/tar download file Cd disk Run./vmware_install.sh Press Y to accept License Agreement Select the ESX version (3.5 or 4.x) Select N to use the inbox storage library This installs the server portion of the software You will need to load the full MegaRAID software on another host and point it to the ESX server
LSI MegaRaid VMware Remote Connect
Useful MegaRAID CLI Resources LSI MegaRAID SAS software User guide http://littleloubug.cisco.com/calif/mr_sas_sw_ug_80-00156-01_rev_j.pdf HWRAID Website http://hwraid.le-vert.net/wiki/lsimegaraidsas MegaRAID Cheat sheet http://tools.rapidsoft.de/perc/perc-cheat-sheet.html Search MegaRAID in your favorite search engine