Laohu cluster user manual. Li Changhua National Astronomical Observatory, Chinese Academy of Sciences 2011/12/26

Similar documents
LSF at SLAC. Using the SIMES Batch Cluster. Neal Adams. Stanford Linear Accelerator Center

Introduction Workshop 11th 12th November 2013

The GPU-Cluster. Sandra Wienke Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU-Cluster. Sandra Wienke March Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Improved Infrastructure Accessibility and Control with LSF for LS-DYNA

IBM Spectrum LSF Version 10 Release 1. Release Notes IBM

The RWTH Compute Cluster Environment

Installation Instructions for Platform Suite for SAS Version 9.1 for Windows

Fixed Bugs for IBM Platform LSF Version

Installation Instructions for Platform Suite for SAS Version 10.1 for Windows

Platform LSF concepts and terminology

Running Jobs with Platform LSF. Version 6.0 November 2003 Comments to:

Platform LSF Version 9 Release 1.2. Quick Reference GC

Platform LSF Security. Platform LSF Version 7.0 Update 5 Release date: March 2009 Last modified: March 16, 2009

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. A Quick Tour of IBM Platform LSF

Introduction to High-Performance Computing (HPC)

Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS

Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS

Minerva Scientific Computing Environment. An Introduction to Minerva

Best practices. Using Affinity Scheduling in IBM Platform LSF. IBM Platform LSF

Platform LSF Desktop Support User s Guide

Fixed Bugs for IBM Spectrum LSF Version 10.1 Fix Pack 1

COMP Superscalar. COMPSs at BSC. Supercomputers Manual

Jenkins and Load Sharing Facility (LSF) Enables Rapid Delivery of Device Driver Software. Brian Vandegriend. Jenkins World.

SAS Certified Deployment and Implementation Specialist for SAS Grid Manager 9.4

NCAR Computation and Information Systems Laboratory (CISL) Facilities and Support Overview

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

High Performance Computing How-To Joseph Paul Cohen

Platform LSF Version 9 Release 1.2. Security SC

The following bugs have been fixed in LSF Version Service Pack 2 between 30 th May 2014 and 31 st January 2015:

Installation Instructions for Platform Suite for SAS Version 4.1 for UNIX

Running the model in production mode: using the queue.

Using the Yale HPC Clusters

Fixed Bugs for IBM Platform LSF Version Fix Pack 3

Simple examples how to run MPI program via PBS on Taurus HPC

Using SHARCNET: An Introduction

Platform LSF Version 9 Release 1.2. Running Jobs SC

SmartSuspend. Achieve 100% Cluster Utilization. Technical Overview

UMass High Performance Computing Center

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende

Management of batch at CERN

Introduction to Abel/Colossus and the queuing system

Name Department/Research Area Have you used the Linux command line?

IBM Spectrum LSF Version 10 Release 1. Release Notes IBM

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Introduction to High-Performance Computing (HPC)

Bioinformatics Facility at the Biotechnology/Bioservices Center

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

CycleServer Grid Engine Support Install Guide. version

Exercise: Calling LAPACK

NCL on Yellowstone. Mary Haley October 22, 2014 With consulting support from B.J. Smith. Sponsored in part by the National Science Foundation

Using Docker in High Performance Computing in OpenPOWER Environment

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Graham vs legacy systems

Introduction to High-Performance Computing (HPC)

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Using Platform LSF Advanced Edition

Workstations & Thin Clients

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Using ISMLL Cluster. Tutorial Lec 5. Mohsan Jameel, Information Systems and Machine Learning Lab, University of Hildesheim

Getting started with the CEES Grid

A Distributed Load Sharing Batch System. Jingwen Wang, Songnian Zhou, Khalid Ahmed, and Weihong Long. Technical Report CSRI-286.

Using the Yale HPC Clusters

Troubleshooting your SAS Grid Environment Jason Hawkins, Amadeus Software, UK

Gridbus Portlets -- USER GUIDE -- GRIDBUS PORTLETS 1 1. GETTING STARTED 2 2. AUTHENTICATION 3 3. WORKING WITH PROJECTS 4

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Release Notes for Platform LSF Version 7 Update 2

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

Parallelizing data processing on cluster nodes with the Distributor

1 Bull, 2011 Bull Extreme Computing

Installing and running COMSOL 4.3a on a Linux cluster COMSOL. All rights reserved.

HPC DOCUMENTATION. 3. Node Names and IP addresses:- Node details with respect to their individual IP addresses are given below:-

Siemens PLM Software. HEEDS MDO Setting up a Windows-to- Linux Compute Resource.

.NET Library for Seamless Remote Execution of Supercomputing Software

Fixed Bugs for IBM Platform LSF Version 9.1.3

The DTU HPC system. and how to use TopOpt in PETSc on a HPC system, visualize and 3D print results.

Using Platform LSF MultiCluster. Version 6.1 November 2004 Comments to:

High Performance Computing (HPC) Club Training Session. Xinsheng (Shawn) Qin

Supercomputing environment TMA4280 Introduction to Supercomputing

An Introduction to Cluster Computing Using Newton

3.1.1 Intel C++ Compiler for Linux dd Intel ddd Intel Fortran Compiler for Linux dd... 11

Working on the NewRiver Cluster

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Exercises: Abel/Colossus and SLURM

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Chap 4, 5: Process. Dongkun Shin, SKKU

Workload management at KEK/CRC -- status and plan

Upgrading Platform LSF on UNIX and Linux

LSF Reference Guide. Version June Platform Computing Corporation

HTCondor Essentials. Index

Introduction to GALILEO

Effective Use of CCV Resources

Introduction to High Performance Computing Using Sapelo2 at GACRC

Kepler Overview Mark Ebersole

MERCATOR TASK MASTER TASK MANAGEMENT SCREENS:- LOGIN SCREEN:- APP LAYOUTS:-

EIC system user manual

IBM Platform LSF 9.1.3

Lab Compiling using an IDE (Eclipse)

GPU Cluster Usage Tutorial

HPC Introductory Course - Exercises

Transcription:

Laohu cluster user manual Li Changhua National Astronomical Observatory, Chinese Academy of Sciences 2011/12/26

About laohu cluster Laohu cluster has 85 hosts, each host has 8 CPUs and 2 GPUs. GPU is Nvidia C1060, and the CUDA lib version is 4.0,please ref. http://laohu.bao.ac.cn. Laohu cluster has a login node, laohu.bao.ac.cn. user can login by ssh. After login, you can upload and download files, compile code. (The SSH user guide references appendix I ). Laohu cluster has firewall, which deploy whitelisting system. Please send me your machine internet IP address before login. Special Host group information: host_study (c1104 c0504 c0407 c1106 c1107), is scheduled by the cpu_test and gpu_test queue.

About LSF Platform LSF (Load Sharing Facility) is a suite of distributed resource management products that: Connects computers into a Cluster (or Grid ) ; Monitors loads of systems ; Distributes, schedules and balances workload; Controls access and load by policies; Analyzes the workload; High Performance Computing (HPC) environment; LSF can schedules, controls, and tracks the job according to configured Policies, so, any job or task can be run on the best node available. the main command for job management: bqueues, bsub, bjobs, bkill, bpeek etc.

Displays queues Sample: bqueues ------------------------------------------------------------------------------------------------------------------ QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP normal 30 Open : Active - 8 - - 22 2 20 0 long 30 Open : Active - 304 - - 52 12 40 0 ------------------------------------------------------------------------------------------------------------------ Notes: QUEUE_NAME: The name of the queue PRIO: The priority of the queue, The larger the value, the higher the priority. STATUS: The current status of the queue. Open:Active The queue is able to accept jobs and Jobs in the queue may be started. Closed:Active The queue is not able to accept jobs, but Jobs in the queue may be started. MAX: The maximum number of job slots that can be used by the jobs from the queue. - means no limit. JL/U: The maximum number of job slots each user can use for jobs in the queue. NJOBS: The total number of job slots held currently by jobs in the queue. PEND: The number of job slots used by pending jobs in the queue RUN: The number of job slots used by running jobs in the queue SUSP: The number of job slots used by suspended jobs in the queue.

Displays queue detailed information Sample: bqueues l Notes: CPULIMIT: The maximum CPU time a job can use, in minutes, relative to the CPU factor of the named host. PROCLIMIT: The [minimum],[default],maximum number of processors allocated to a job. PROCESSLIMIT: The maximum number of concurrent processes allocated to a job.

Current Queue Configuration cpu_32: the maximum number of job slot can allocate to a job is 192, the minimum number is 150. cpu_large: The maximum number of job slot can allocate to a job is 128, the minimum number is 30. cpu_small: The maximum number of job slot can allocate to a job is 48, the minimum number is 1. cpu_test: the maximum number of job slot is 12. gpu_32: the number of job slot is between 32 and 16. gpu_16: the number of job slot is between 16 and 8. gpu_8: the number of job slot is between 8 and 4. gpu_test: the number of job slot is between 10 and 1.

Submits a job Command format: bsub [options] command [arguments] Options: command options, such as n,-q,-r etc. Arguments: program arguments After submitted, you can execute bjobs to query host running your jobs, then, you can ssh to this host and execute top command to display the cpu/mem usage rate of this host.

Submits a job to a special queue Option q specifies a special queue. Sample 1: bsub -q serial executable1 submits job to queue serial. If succeeds,displays output as follows: 79722 is the job id, which is unique.

Specifies the CPU number required Option n specifies the number of processors required to run the job (some of the processors may be on the same multiprocessor host). About GPU application, suggests apply 2 CPUs and 2 GPUs in a host, About CPU application, no GPU computing, suggests apply 6 CPUs in a host. In order to avoid several CPU processes to use same GPU, please don t use a const gpu device id in your program.

Other useful option of bsub Specify a standard output and error output file -i -o e Sample 1: bsub -i executable1.input -o executable1-%j.log -e executable1-%j.err executable1 Exclusive execution mode -x In exclusive execution mode, your job runs by itself on a host. It is dispatched only to a host with no other jobs running, and LSF does not send any other jobs to the host until the job completes. Resource requirement -R Sample 2: bsub -R 'select[type==any] span[ptile=1] usage[ngpus=2] executable1 Runs the job on hosts that meets the specified resource requirements

Submits Job by Script Sample 1: cpujob.lsf #!/bin/sh #BSUB -q naoc_c #job queue, modify according to user #BSUB -a openmpi-qlc #BSUB -R 'select[type==any] span[ptile=6] #resource requirement of host #BSUB -o out.test #output file #BSUB -n 256 #the maximum number of CPU mpirun.lsf --mca "btl openib,self" Gadget2wy WJL.PARAM # need modify for user s program. Exec method: bsub < cpujob.lsf

Submits Job by Script Sample 2: #!/bin/sh #BSUB -q naoc_g #BSUB -a openmpi-qlc gpujob.lsf #job queue #BSUB -R 'select[type==any] span[ptile=2] rusage[ngpus=2] #BSUB -o out.test #BSUB -n 20 #resource requirement of host #output file #the maximum number of CPU mpirun.lsf --prefix "/usr/mpi/gcc/openmpi-1.3.2-qlc" -x "LD_LIBRARY_PATH=/export/cuda/lib:/usr/mpi/gcc/openmpi-1.3.2- qlc/lib64"./phi-grape.exe # need modify for user s program. Exec method: bsub < gpujob.lsf

Displays job status Sample 1: bjobs Job 79726 has 2 process in host node31 and 1 process in host node4. Job 79727 is pending. Sample 2: bjobs -l 79727

Terminates a job Sample: bkill 79722 Job <79722> is being terminated. Please terminates the job that has errors or don t run later, so as to free host resource.

Suspends and resumes job Sample:bstop 79727 Notes: 1. Suspends unfinished jobs, so other pending job can start to run. 2. As to the job is running, even if it be suspended, but it don t free using resource. So, I suggest that you shouldn t suspend running jobs. Sample:bresume 79727 Continues to run job 79727.

Modifies pending job Sample 1: btop 79727 Moves job 79727 to the top of queue. Sample 2: bbot 79727 Moves job 79727 to the bottom of queue. Sample 3: bmod -Z executable2 -q long -n 12 79727 modifies job 79727 s executable program, queue and CPU number.

Displays output of job Sample 1:bpeek 79727 Displays the stdout and stderr output of job 79727 Sample 2: bpeek -f 79727 Displays the stdout and stderr output of job 79727 using the command tail f If you use the parameter -o, -e in bsub command, you can open the specified file to display output.

Displays Host Status Sample 1:bhosts Sample 2:lsload

Displays user s information Sample 1:busers lich Notes: 1. MAX: The maximum number of job slots that can be processed concurrently for user lich jobs. 2. NJOBS: The current number of job slots used by user lich jobs. 3. PEND: The number of pending job slots used by jobs of user lich. 4. RUN: The number of job slots used by running jobs of user lich.

Appendix I SSH Introduction Under linux : ssh user@laohu.bao.ac.cn when popup the password dialog window, please fill your correct password. you can use scp command to copy file between your local machine and remote cluster login host. scp localfile user@laohu.bao.ac.cn:~/ Under windows: First, installs a ssh client soft., such as SecureShellClient which is simple to use. About download address and user guide, please reference another documentation secureshellclient.pdf