IBM Platform HPC IBM Platform HPC V3.2: GPU Management with NVIDIA CUDA 5 Gábor Samu Technical Product Manager IBM Systems and Technology Group Mehdi Bozzo-Rey HPC Solutions Architect IBM Systems and Technology Group Issued: February 11, 2013; Revised August 2013
Executive Summary... 3 Introduction... 4 Environment Preparation... 5 Provision nodes equipped with NVIDIA Tesla... 9 Monitor nodes equipped with NVIDIA Tesla... 10 Best practices... 11 Conclusion... 12 Further reading... 12 Notices... 13 Trademarks... 14 Contacting IBM... 14 IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 2 of 14
Executive Summary IBM Platform HPC Version 3.2 ( Platform HPC ) is easy-to-use, yet comprehensive technical computing management software. It includes as standard GPU scheduling, management and monitoring capabilities for systems equipped with NVIDIA Tesla GPUs. Platform HPC 3.2 has support for NVIDIA CUDA 4.1, including a CUDA 4.1 software Kit which facilitates simplified deployment of the software in the clustered environment. Later versions of NVIDIA Tesla based upon the NVIDIA Kepler architecture require CUDA 5 to operate. This document provides steps to install and configure a Platform HPC 3.2 cluster with NVIDIA Tesla Kepler hardware. IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 3 of 14
Introduction The document serves as a guide to enabling Platform HPC GPU management capabilities with NVIDIA CUDA 5. The steps below assume familiarity with Platform HPC commands and concepts. The procedure relies on the following capabilities of Platform HPC to deploy NVIDIA CUDA 5: Cluster File Manager (CFM): This will be used to automate patching of the system boot files to perform the installation of NVIDIA CUDA 5. Post-Install script: This is used to trigger the execution of the system startup file on first boot post-provisioning. Note that the procedure provided in this document is generic and may be used to deploy other software in a cluster managed by Platform HPC. IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 4 of 14
Environment Preparation The following steps assume that the Platform HPC V3.2 head node has been installed and that compute nodes equipped with NIVDIA Tesla GPUs are available to be added (provisioned) to the cluster. The specifications of the example environment follow: IBM Platform HPC V3.2 (Red Hat Enterprise Linux 6.2 x64) NVIDIA Tesla K20c NVIDIA CUDA 5 (cuda_5.0.35_linux_64_rhel6.x-1.run) Two node cluster o installer000 (cluster head node) o compute000 (compute node, equipped with NVIDIA Tesla K20C The following steps enable provisioning compute nodes equipped with NVIDIA Tesla: 1. The Administrator of the cluster must download NVIDIA CUDA 5 and copy it to the /shared directory on the Platform HPC head node. This directory must be NFSmounted by all compute nodes managed by Platform HPC. Note that the execute bit must be set on the CUDA package file. # cp./cuda_5.0.35_linux_64_rhel6.x-1.run /shared # chmod 755 /shared/cuda_5.0.35_linux_64_rhel6.x-1.run # ls -la /shared/cuda* -rwxr-xr-x 1 root root 702136770 Apr 4 20:59 /shared/cuda_5.0.35_linux_64_rhel6.x-1.run 2. On the Platform HPC head node, create a new node group for nodes equipped with NVIDIA Tesla hardware. Give the new node group template the name computerhel-6.2-x86_64_tesla. It is a copy of the built-in node group template compute-rhel-6.2-x86_64. # kusu-ngedit -c compute-rhel-6.2-x86_64 -n compute-rhel-6.2-x86_64_tesla Running plugin: /opt/kusu/lib/plugins/cfmsync/getent-data.sh.. New file found: /etc/cfm/compute-rhel-6.2- x86_64_tesla/root/.ssh/authorized_keys New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/root/.ssh/id_rsa New file found: /etc/cfm/compute-rhel-6.2- x86_64_tesla/opt/kusu/etc/logserver.addr New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/opt/lsf/conf/hosts New file found: /etc/cfm/compute-rhel-6.2- x86_64_tesla/opt/lsf/conf/profile.lsf New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/group.merge New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/hosts.equiv New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/hosts New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/shadow.merge New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/.updatenics New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/passwd.merge IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 5 of 14
New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/fstab.kusuappend New file found: /etc/cfm/compute-rhel-6.2-x86_64_tesla/etc/ssh/ssh_config.. Distributing 76 KBytes to all nodes. 3. Configure the CFM framework to patch the /etc/rc.local file on a set of compute nodes. The following example script checks for the existence of the NVIDIA CUDA tool nvidia-smi on a node in /usr/bin. If nvidia-smi is not found in /usr/bin, the script will mount the NFS share /depot/shared to /share and will run the NVIDIA CUDA installation with the option for silent (non-interactive) installation. Note: You must modify this script according to the Platform HPC environment and NVIDIA CUDA 5 package filename (cuda_5.0.35_linux_64_rhel6.x- 1.run). Save the following script as rc.local.append in the /etc/cfm/compute-rhel- 6.2-x86_64_Tesla/etc directory on the Platform HPC head node. # Copyright International Business Machine Corporation, 2013 # This information contains sample application programs in source language, which # illustrates programming techniques on various operating platforms. You may copy, # modify, and distribute these sample programs in any form without payment to IBM, # for the purposes of developing, using, marketing or distributing application # programs conforming to the application programming interface for the operating # platform for which the sample programs are written. These examples have not been # thoroughly tested under all conditions. IBM, therefore, cannot guarantee or # imply reliability, serviceability, or function of these programs. The sample # programs are provided "AS IS", without warranty of any kind. IBM shall not be # liable for any damages arising out of your use of the sample programs. # Each copy or any portion of these sample programs or any derivative work, must # include a copyright notice as follows: # (C) Copyright IBM Corp. 2013. # The following example script portion is used to install NVIDIA CUDA 5 # on the specified compute nodes. This is done by patching the # /etc/rc.local file and running the NVIDIA CUDA 5 intaller in silent mode. # The installation will not be performed if NVIDIA CUDA is found installed # on the system already (test if /usr/bin/nvidia-smi exists). # Pre-requisites: # 1. NVIDIA CUDA 5 package exists in /shared and has permissions root/755. # 2. /depot/shared on IBM Platform HPC head node is mounted at /shared # (you will require to adjust the IP address here according to your environment). if [! -f /usr/bin/nvidia-smi ] then mkdir /shared mount -t nfs 192.0.2.150:/depot/shared /shared /shared/cuda_5.0.35_linux_64_rhel6.x-1.run -driver -toolkit -silent fi IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 6 of 14
4. Create a post-installation script which will be configured to execute on a set of compute nodes. The post-installation script forces the execution of the updated /etc/rc.local script during the initial boot of a node after provisioning. Save the following script as /root/run_rc_local.sh on the Platform HPC head node. Note that this script will be specified as a post-installation script in subsequent steps. # Copyright International Business Machine Corporation, 2013 # This information contains sample application programs in source language, which # illustrates programming techniques on various operating platforms. You may copy, # modify, and distribute these sample programs in any form without payment to IBM, # for the purposes of developing, using, marketing or distributing application # programs conforming to the application programming interface for the operating # platform for which the sample programs are written. These examples have not been # thoroughly tested under all conditions. IBM, therefore, cannot guarantee or # imply reliability, serviceability, or function of these programs. The sample # programs are provided "AS IS", without warranty of any kind. IBM shall not be # liable for any damages arising out of your use of the sample programs. # Each copy or any portion of these sample programs or any derivative work, must # include a copyright notice as follows: # (C) Copyright IBM Corp. 2013. # The following script will force the execution of /etc/rc.local after it is # updated via CFM on the initial boot of a node after provisioning. #!/bin/sh -x /etc/rc.local > /tmp/runrc.log 2>&1 5. On the Platform HPC head node, start kusu-ngedit and edit the node group named installer-rhel-6.2-x86_64. The following updates are required to enable monitoring of GPU devices in the Platform HPC Web console. On the Components screen, enable component-platform-lsf-gpu under platform-lsf-gpu. Select Yes to synchronize changes. 6. On the Platform HPC head node, start kusu-ngedit and edit the node group named compute-rhel-6.2-x86_64_tesla. The following updates are required to enable the GPU monitoring agents on nodes, in addition to the required OS software packages, and kernel parameters for NVIDIA GPUs. On the Boot Time Parameters screen, add the following Kernel Parameters at the end of the line: rdblacklist=nouveau nouveau.modeset=0 On the Components screen, enable component-platform-lsf-gpu under platform-lsf-gpu. On the Optional Packages screen, enable the following packages: IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 7 of 14
kernel-devel gcc gcc-c++ On the Custom Scripts screen, add the script /root/run_rc_local.sh Select Yes to synchronize changes. 7. Update the configuration of the Platform HPC workload manager. This is required for the NVIDIA CUDA specific metrics to be taken into account. # kusu-addhost -u Running plugin: /opt/kusu/lib/plugins/cfmsync/getent-data.sh Updating installer(s) Setting up dhcpd service... Setting up dhcpd service successfully... Setting up NFS export service... Running plugin: /opt/kusu/lib/plugins/cfmsync/getent-data.sh Distributing 60 KBytes to all nodes. Updating installer(s) IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 8 of 14
Provision nodes equipped with NVIDIA Tesla After completing the environment pre-requisites, complete the following steps to the provision the compute nodes equipped with NVIDIA Tesla. You can provision nodes using the Platform HPC Web Console, or with the kusuaddhost command. The following steps provision the node using the kusu-addhost command with the newly created node group template compute-rhel-6.2- x86_64_tesla. Note: Once nodes are discovered by kusu-addhost, the administrator must exit from the listening mode by pressing Control-C. This will complete the node discovery process. # kusu-addhost -i eth0 -n compute-rhel-6.2-x86_64_tesla -b Scanning syslog for PXE requests... Discovered Node: compute000 Mac Address: 00:1e:67:31:45:58 ^C Command aborted by user... Setting up dhcpd service... Setting up dhcpd service successfully... Setting up NFS export service... Running plugin: /opt/kusu/lib/plugins/cfmsync/getent-data.sh Distributing 84 KBytes to all nodes. Updating installer(s) IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 9 of 14
Monitor nodes equipped with NVIDIA Tesla After provisioning all of your GPU-equipped nodes, you can now monitor GPU-related metrics through the Platform HPC Web Console. Navigate to the following URL with any supported Web browser, and log in as a user with Administrative privileges: http://<ibm_platform_hpc_head_node> The Platform HPC Web Console provides the following views of GPU metrics: Dashboard view Host list view (GPU tab) Dashboard view In the Dashboard view, hover the mouse pointer over a node equipped with NVIDIA Tesla. The popup will display the GPU temperature and any ECC errors. IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 10 of 14
Host List view (GPU tab) In the Host List view, select a node equipped with NVIDIA Tesla and select the GPUs tab displayed in the bottom portion of the interface. This will display the temperature (Celsius), and any ECC errors for each GPU detected. Best practices IBM Platform HPC V3.2 provides a generic framework for synchronizing files, packages and for the execution of custom scripts post provisioning. These capabilities may be used to install software which is not packaged as an IBM Platform HPC Kit. In this example the installation of NVIDIA CUDA 5 is automated. IBM Platform HPC V3.2 GPU monitoring functions as expected with NVIDIA CUDA 5. IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 11 of 14
Conclusion IBM Platform HPC V3.2 is easy-to-use, yet comprehensive technical computing management software. It includes as standard GPU scheduling, management and monitoring capabilities for systems equipped with NVIDIA Tesla GPUs. This document has described the steps to install and configure an IBM Platform HPC V3.2 cluster with NVIDIA Tesla Kepler hardware. The document has guided you through the steps for enabling IBM Platform HPC V3.2 GPU management capabilities with NVIDIA CUDA 5. You have learned how to prepare your Platform HPC environment, provision and monitor nodes equipped with NVIDIA Tesla. You have also learned some best practices for automating the installation, patching, and synchronization of files and packages using Platform HPC. The procedures provided in this document are generic and may be used to deploy other software in clusters managed by IBM Platform HPC. Further reading IBM Platform HPC Version 3.2 http://www- 03.ibm.com/systems/technicalcomputing/platformcomputing/index.html NVIDIA Developer Zone https://developer.nvidia.com/category/zone/cuda-zone IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 12 of 14
Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-ibm product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON- INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. Without limiting the above disclaimers, IBM provides no representations or warranties regarding the accuracy, reliability or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any recommendations or techniques herein is a customer responsibility and depends on the customer s ability to evaluate and integrate them into the customer s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Anyone attempting to adapt these techniques to their own environment does so at their own risk. This document and the information contained herein may be used solely in connection with the IBM products discussed in this document. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 13 of 14
Information concerning non-ibm products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-ibm products. Questions on the capabilities of non-ibm products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: Copyright IBM Corporation 2013. All Rights Reserved. This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml Windows is a trademark of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Contacting IBM To contact IBM in your country or region, check the IBM Directory of Worldwide Contacts at http://www.ibm.com/planetwide To learn more about IBM Platform Computing HPC, go to http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/hpc/ IBM Platform HPC: GPU Management with NVIDIA CUDA 5 Page 14 of 14