IBM Platform LSF. Best Practices. IBM Platform LSF and IBM GPFS in Large Clusters. Jin Ma Platform LSF Developer IBM Canada

IBM Platform LSF Best Practices IBM Platform LSF 9.1.3 and IBM GPFS in Large Clusters Jin Ma Platform LSF Developer IBM Canada

Table of Contents IBM Platform LSF 9.1.3 and IBM GPFS in Large Clusters... 1 Executive Summary... 3 Introduction... 4 Minimizing the impact of network connection and node failures... 5 Minimizing interference among LSF, GPFS, and applications running on GPFS... 5 Deploy dedicated Network Shared Disk (NSD) servers and storage for LSF... 5 Pre-configure a large inode pool... 6 Install GPFS V3.5.0.19 or above... 6 Configure LSB_JOB_TMPDIR and LSF_TMPDIR... 7 Configure JOB_SPOOL_DIR... 8 Specify the LSF job current working directory and output directory... 9 Best Practices... 11 Conclusion... 12 Further reading... 13 Contributors... 13 Notices... 14 Trademarks... 15 Contacting IBM... 15

Executive Summary This Best Practice Guide provides suggestions and solutions for smooth deployment of IBM Platform LSF and IBM General Parallel File System (GPFS) in a large cluster to achieve high job throughput and cluster reliability, performance, and scalability.

Introduction Experience from past deployment of Platform LSF and IBM GPFS together on large clusters reveals that additional fine-tuning of the two systems is required to minimize interference and to achieve high job throughput as well as optimal cluster stability, performance, and scalability. Possible issues fall into the following groups: 1. LSF and GPFS both depend on network stability and robustness of recovery. If LSF is installed on GPFS, this dependency is amplified if GPFS fails, LSF will also crash. Common problems include: Node expel issue GPFS unmount causing LSF daemon crash if installed on GPFS 2. LSF and GPFS are affected by heavy application load (CPU, memory, network I/O), resulting in node expel issues. 3. GPFS performance issue: inode pool auto expansion can cause GPFS to hang for several minutes. 4. GPFS hangs because of how LSF accesses the file system. If LSF is also installed on GPFS or needs to accesses GPFS, LSF does not respond or crashes. /tmp requires at least 100 MB of free space. If many users mount /tmp to a shared GPFS directory GPFS can crash due to lack of /tmp space. GPFS also can hang when tokens are revoked with multiple clients creating and removing sub-directories or frequent file read/write under the same top directory. For example, if /tmp is mounted on a shared GPFS directory on a diskless node, LSF creates a job temporary directory (/tmp/.<jobid>) when each job is dispatched. This directory is then removed after the job finishes, which can cause GPFS to hang. This Best Practice Guide provides solutions aimed at avoiding or minimizing these issues. Note: IBM Platform LSF 9.1.3 patch #247132 or later is required for these recommendations.

Minimizing the impact of network connection and node failures Both GPFS and LSF rely on a healthy network and nodes to function properly. Network and node failures can be catastrophic to workload efficiency and data integrity. Maintaining a healthy network and nodes are key for smooth cluster operations. A single node failure or connection issue triggers GPFS to expel the node from the cluster to recover and protect data integrity. Node expel operations not only affect the problematic nodes but can also have a ripple effect cascading onto other healthy nodes. If LSF is also installed on GPFS, LSF daemon processes crash if GPFS is unmounted. It is better to make LSF use a totally different network from that of GPFS and applications. Before installing, testing, and starting LSF and GPFS, you must run a network and node health check. IBM has published best practices for node and network health checking tools and procedures. See the references in Further reading for more information. Minimizing interference among LSF, GPFS, and applications running on GPFS Deploy dedicated Network Shared Disk (NSD) servers and storage for LSF The main idea is to isolate LSF from GPFS and applications running on it. Ideally, the file systems for LSF and GPFS that are used by applications should be totally separated. If this is difficult to be implemented and LSF must be installed on GPFS, the next best option is to assign dedicated NSD servers and storage for LSF. User applications should not access these NSD servers and storage for LSF.

The following diagram illustrates the recommended configurations: Pre-configure a large inode pool When GPFS automatically expands the inode pool, you might see a delay in response. If LSF is also installed on GPFS, LSF can be affected by such delay and temporarily stops responding to requests. It is highly recommended to pre-configure a large inode pool in GPFS so that there are always enough inode entries in GPFS, and automatic expansion can be avoided. Install GPFS V3.5.0.19 or above As mentioned above, GPFS V3.5.0.19 (with fix to APAR IV61627 (defect 927501)) or higher is recommended. This fix contains enhancements to handle node expel issues. The impact to LSF is reduced.

Configure LSB_JOB_TMPDIR and LSF_TMPDIR The GPFS daemon requires 100 MB of free space under /tmp on each node. In order to avoid LSF jobs from competing with GPFS for /tmp space, set LSF_TMPDIR in lsf.conf to a directory to GPFS. LSF creates job-specific temporary directory for each job (<jobid>.tmpdir) and removes this directory after the job finishes. Frequent directory creation and deletion under the same directory by multiple GPFS clients can cause GPFS to revoke tokens to protect data integrity. This can significantly affect LSF performance and job throughput. Set the following parameters in lsf.conf: LSB_JOB_TMPDIR=/gpfs/shared_dir/%H LSF_TMPDIR=/tmp # Or not defined. You should create LSB_JOB_TMPDIR as well as all sub-directories of LSF server host names with the right permission before running jobs in LSF. If LSB_JOB_TMPDIR is defined on a shared file system, all users should be able to write to this directory from all hosts. If LSB_JOB_TMPDIR sub-directories are not created ahead of time, LSF creates them. If you change LSB_JOB_TMPDIR run badmin hrestart all to make the new setting take effect.

Configure JOB_SPOOL_DIR In lsb.params, set JOB_SPOOL_DIR to the directory where you want to store the job files as well as stderr and stdout files for each job after job is dispatched to a host. These LSF temporary files are cleaned after job finishes. The default location is user's HOME directory ($HOME/.lsbatch). Similar to the job-specific temporary directory, multiple clients reading and writing files under the same directory triggers the token revoke action, which can cause delays in accessing GPFS. Job setup becomes sequential with direct impact on LSF job throughput and longer turnaround time. If JOB_SPOOL_DIR is set on GPFS or user HOME directory is on GPFS, you must add %H to the path so that JOB_SPOOL_DIR will be located under the sub-directory with the name of the first execution host for the job: JOB_SPOOL_DIR=%H or JOB_SPOOL_DIR=/gpfs/shared/%H If you specify JOB_SPOOL_DIR=%H, the actual job spool directory is created under $HOME/.lsbatch/%H. LSF creates the %H subdirectory with permission 700. If you specify JOB_SPOOL_DIR=/gpfs/shared/%H, LSF creates the %H subdirectory with permission 777. You can use other patterns in JOB_SPOOL_DIR: %H #first execution host %U #job exec user name %C #Execution cluster name %JG #job group %P #job project

Put %H %U before all other variable patterns in directory paths It is also highly recommended that these directories are created ahead of time, so that LSF does not need to create them. Make sure the permission is set properly: If JOB_SPOOL_DIR is under $HOME, it should have 700 permission. If JOB_SPOOL_DIR is under a shared directory, it should have 777 permission. One exception is that if JOB_SPOOL_DIR=/shared_top/%U/ where %U is the first pattern in the JOB_SPOOL_DIR, it should be of permission 700. If JOB_SPOOL_DIR is pre-created, LSF will not change permissions. The LSF administrator must create the JOB_SPOOL_DIR with proper permissions. Specify the LSF job current working directory and output directory You can specify the LSF job current working directory and output directory in LSF the following ways: bsub -cwd LSB_JOB_CWD environment variable JOB_CWD in application profile DEFAULT_JOB_CWD in lsb.parms bsub -outdir DEFAULT_JOB_OUTDIR in lsb.params The job will run under the specified working directory and generate files to the specified output directory. If many jobs create and delete subdirectories under the same working directory, the GPFS token revoke behavior is triggered and the response of GPFS will be slowed down. This can impact the overall job throughput as well as LSF performance if LSF is installed and run on GPFS. LSF supports the %H (host-based) pattern in the job working directory and output directory. %H is replaced with host name of the first execution host of the job.

In addition to %H, you can also use the following patterns in the job current working directory and output directory: %J - jobid %JG - job group %I - index %EJ - execution job id %EI - execution index %P - project name %U - user name %H - 1st execution host %G user group Put %U and %H before any other subdirectories In general, you should put %U and %H before any other subdirectories in job current working directory (cwd) and output directory (outdir) specifications. If access to specified cwd or outdir is shared by multiple users, create subdirectories before the cluster is started. See the topic Using Platform LSF job directories on the LSF product family wiki on IBM developerworks.

Best Practices The key items of the best practices is to maintain healthy network connections and isolate LSF from the application on GPFS in terms of storage and servers to minimize interference. Complete node health check and network connection verification. Prepare and configure dedicated NSD servers and storage for LSF use only. User applications use separate servers and storage. Always keep LSF and GPFS binaries up-to-date. Configure GPFS and LSF to use separate NSD servers and storage. Configure a large inode pool in GPFS. Configure LSB_JOB_TMPDIR, JOB_SPOOL_DIR, job current working directory, and job output directory with %H (host-based) pattern so that if LSF daemons can concurrently create or delete sub-directories and read and write files on GPFS. Pre-create per-host based directories with appropriate permission on GPFS before you start LSF.

Conclusion The purpose of this best practice guide is to make LSF and GPFS deployment smooth and simple by minimizing the impact from OS/network/hardware as well as interferences of these two systems. By completing the cluster health check and configuring LSF and GPFS in the recommended way, the LSF job throughput should be improved. The cluster performance, scalability, and usability are also enhanced.

Further reading IBM Cluster Health Check https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/ Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20C entral/page/ibm%20cluster%20health%20check IBM GPFS Knowledge Centre: http://www-01.ibm.com/support/knowledgecenter/ssfkcn/gpfs_welcome.html Contributors Enci Zhong Chong Chen Demin Zhang Jiuwei Jiang

Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-ibm product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON- INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. Without limiting the above disclaimers, IBM provides no representations or warranties regarding the accuracy, reliability or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any recommendations or techniques herein is a customer responsibility and depends on the customer s ability to evaluate and integrate them into the customer s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Anyone attempting to adapt these techniques to their own environment does so at their own risk. This document and the information contained herein may be used solely in connection with the IBM products discussed in this document. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.

Information concerning non-ibm products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-ibm products. Questions on the capabilities of non-ibm products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: Copyright IBM Corporation 2014. All Rights Reserved. This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml Windows is a trademark of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Contacting IBM To provide feedback about this paper, write to jma@ca.ibm.com To contact IBM in your country or region, check the IBM Directory of Worldwide Contacts at http://www.ibm.com/planetwide To learn more about IBM Platform products, go to http://www.ibm.com/systems/platformcomputing/