Content. MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler. IBM PSSC Montpellier Customer Center

Similar documents
IBM Scheduler for High Throughput Computing on IBM Blue Gene /P Table of Contents

Batch Systems. Running calculations on HPC resources

Blue Gene/P Application User Worshop

Blue Gene/Q User Workshop. User Environment & Job submission

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

OpenPBS Users Manual

Answers to Federal Reserve Questions. Training for University of Richmond

Batch Systems. Running your jobs on an HPC machine

Introduction to HPC Numerical libraries on FERMI and PLX

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Running applications on the Cray XC30

Table of Contents. Table of Contents Job Manager for remote execution of QuantumATK scripts. A single remote machine

Introduction to GALILEO

Shell Scripting. With Applications to HPC. Edmund Sumbar Copyright 2007 University of Alberta. All rights reserved

Cluster Network Products

IBM PSSC Montpellier Customer Center. Content

Grid Compute Resources and Job Management

Quick Guide for the Torque Cluster Manager

Introduction to CINECA Computer Environment

SCALABLE HYBRID PROTOTYPE

Introduction to PICO Parallel & Production Enviroment

Compiling applications for the Cray XC

Porting Applications to Blue Gene/P

Introduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)

Debugging Intel Xeon Phi KNC Tutorial

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

User Guide of High Performance Computing Cluster in School of Physics

Introduction to GALILEO

Our new HPC-Cluster An overview

Grid Engine Users Guide. 5.5 Edition

MPICH User s Guide Version Mathematics and Computer Science Division Argonne National Laboratory

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber

Answers to Federal Reserve Questions. Administrator Training for University of Richmond

Unix Processes. What is a Process?

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

Early experience with Blue Gene/P. Jonathan Follows IBM United Kingdom Limited HPCx Annual Seminar 26th. November 2007

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

Tutorial 4: Condor. John Watt, National e-science Centre

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

A unified user experience for MPI jobs in EMI

Grid Engine - A Batch System for DESY. Andreas Haupt, Peter Wegner DESY Zeuthen

PBS Pro Documentation

HTC Brief Instructions

Introduction to GALILEO

Introducing the HTCondor-CE

How to for compiling and running MPI Programs. Prepared by Kiriti Venkat

SGE Roll: Users Guide. Version 5.3 Edition

Using ISMLL Cluster. Tutorial Lec 5. Mohsan Jameel, Information Systems and Machine Learning Lab, University of Hildesheim

SGE Roll: Users Guide. Version Edition

MIC Lab Parallel Computing on Stampede

PBS Pro and Ansys Examples

Grid Engine Users Guide. 7.0 Edition

Name Department/Research Area Have you used the Linux command line?

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

XSEDE New User Tutorial

Parallel Job Support in the Spanish NGI! Enol Fernández del Cas/llo Ins/tuto de Física de Cantabria (IFCA) Spain

Symmetric Computing. John Cazes Texas Advanced Computing Center

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

The RWTH Compute Cluster Environment

Contents: 1 Basic socket interfaces 3. 2 Servers 7. 3 Launching and Controlling Processes 9. 4 Daemonizing Command Line Programs 11

Symmetric Computing. SC 14 Jerome VIENNE

Backgrounding and Task Distribution In Batch Jobs

PROGRAMMING MODEL EXAMPLES

Effective Use of CCV Resources

Introduction to Parallel Programming with MPI

Shark Cluster Overview

Advanced Job Launching. mapping applications to hardware

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Module 4: Working with MPI

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

High Performance Beowulf Cluster Environment User Manual

Introduction to HPC Using zcluster at GACRC

DEBUGGING ON FERMI PREPARING A DEBUGGABLE APPLICATION GDB. GDB on front-end nodes

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Getting started with the CEES Grid

Ambiente CINECA: moduli, job scripts, PBS. A. Grottesi (CINECA)

CLE and How to. Jan Thorbecke

Running LAMMPS on CC servers at IITM

Introduction to HPC Using zcluster at GACRC

Implementation of Parallelization

The Supercomputing Facility for Bioinformatics & Computational Biology, IIT Delhi

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY

Lab: Hybrid Programming and NUMA Control

Running Jobs on Blue Waters. Greg Bauer

XSEDE New User Tutorial

Supercomputing environment TMA4280 Introduction to Supercomputing

Grid Examples. Steve Gallo Center for Computational Research University at Buffalo

Knights Landing production environment on MARCONI

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

XSEDE New User Tutorial

Introduction to HPC Using zcluster at GACRC

Viglen NPACI Rocks. Getting Started and FAQ

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Introduction to CINECA HPC Environment

Redpaper. Evolution of the IBM System Blue Gene Solution. Front cover. ibm.com/redbooks. A new generation of hardware

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Resource Management at LLNL SLURM Version 1.2

Transcription:

Content IBM PSSC Montpellier Customer Center MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler

Control System Service Node (SN) An IBM system-p 64-bit system Control System and database are on this system Access to this system is generally privileged Communication with Blue Gene via a private 1Gb control ethernet Database A commercial database tracks state of the system Hardware inventory Partition configuration RAS data Environmental data Operational data including partition state, jobs, and job history Service action support for hot plug hardware Administration and System status Administration either via a console or web Navigator interfaces

Service Node Database Structure DB2 Configuration Database Operational Database Environmental Database RAS Database Configuration database is the representation of all the hardware on the system Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex Useful log files: /bgsys/logs/bgp

Job Launching Mechanism mpirun Command Standard mpirun options supported May be used to launch any job, not just MPI based applications Has options to allocate partitions when a scheduler is not in use Scheduler APIs enable various schedulers LoadLeveler SLURM Platform LSF Altair PBS Pro Cobalt. Note: All the schedulers are on mpirun/mpiexec

MPIRUN Implementation Identical Functionalities to BG/L Implementation + New implementation + New options No more rsh/ssh mechanism for security reason, replace by a deamon running on the Service node freepartition command integrated as an option (-free) Standard input (STDIN) is supported on BGP (only MPI task 0)

MPIRUN Command Parameters 1 -args "program args" Pass "program args" to the BlueGene job on the compute nodes -cwd <Working Directory> Specifies the full path to use as the current working directory on the compute nodes. The path is specified as seen by the I/O and compute nodes -exe <Executable> Specifies the full path to the executable to run on the compute nodes. The path is specified as seen by the I/O and compute nodes -mode { SMP DUAL VN } specify what mode the job will run in. Choices are coprocessor or virtual node mode -np <Nb MPI Tasks> Create exactly n MPI ranks for the job. Aliases are -nodes and -n

MPIRUN Command Parameters 2 -enable_tty_reporting By default MPIRUN will tell the control system and the C runtime on the compute nodes that STDIN, STDOUT and STDERR are tied to TTY type devices. Enable STDOUT bufferization (GPFS blocksize) -env <Variable Name>=<Variable Value>" Set an environment variable in the environment of the job on the compute nodes -expenv <Variable Name> Export an environment variable in mpiruns current environment to the job on the compute nodes -label Use this option to have mpirun label the source of each line of output. -partition <Block ID> Specify a predefined block to use -mapfile <mapfile> Specify an alternative MPI toplogy. The mapfile path must be fully qualified as seen by the I/O and compute nodes -verbose { 0 1 2 3 4 } Set the 'verbosity' level. The default is 0 which means that mpirun will not output any status or diagnostic messages unless a severe error occurs. If you are curious as to what is happening try levels 1 or 2. All mpirun generated status and error messages appear on STDERR.

MPIRUN Command Reference (Documentation)

MPIRUN Example mpirun partition XXX np 128 mode SMP exe /patch/exe cwd working_directory env OMP_NUM_THREADS=4 XLSMPOPTS=spins=0:yields=0:stack=64000000 Execution Settings 128 MPI Tasks SMP Mode 4 OpenMP Threads 64 MB Thread Stack Mpirun application program interfaces available: get_paramaters, mpirun_done

MPIRUN Environment Variables Most command line options for mpirun can be specified using an environment variable -partition MPIRUN_PARTITION -nodes MPIRUN_NODES -mode MPIRUN_MODE -exe MPIRUN_EXE -cwd MPIRUN_CWD -host MMCS_SERVER_IP -env MPIRUN_ENV -expenv MPIRUN_EXP_ENV -mapfile MPIRUN_MAPFILE -args MPIRUN_ARGS -label MPIRUN_LABEL -enable_tty_reporting MPIRUN_ENABLE_TTY_REPORTING

STDIN / STDOUT / STDERR Support STDIN, STDOUT, and STDERR work as expected You can pipe or redirect files into mpirun and pipe or redirect output from mpirun STDIN may also come from the keyboard interactively Any compute node may send STDOUT or STDERR data Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute nodes that it is writing to TTY devices. This is because logically MPIRUN looks like a pipe; it can not do seeks on STDIN, STDOUT, and STDERR even if they are coming from files. As always, STDIN, STDOUT and STDERR are the slowest ways to get input and output from a supercomputer Use them sparingly STDOUT is not buffered and can generate a huge overhead for some applications Such applications should buffer the stdout with option -enable_tty_reporting

MPIEXEC Command What is mpiexec? Method for launching and interacting with parallel Mutliple Program Multiple Data (MPMD) jobs on BlueGene/P Very similar to mpirun with the only exception being the arguments supported by mpiexec are slightly different Command Limitations A pset is the smallest granularity for each executable, though one executable can span multiple psets You must use every compute node of each pset, specifically different -np values are not supported The job's mode (SMP, DUAL, VNM) must be uniform across all psets

MPIEXEC Command Parameters Only parameter / environmental supported by mpiexec that is not supported by mpirun -configfile / MPIRUN_MPMD_CONFIGFILE The following parameters / environmentals are not supported by mpiexec since their use is ambiguous for MPMD jobs -args / MPIRUN_ARGS -cwd / MPIRUN_CWD -env / MPIRUN_ENV -env_all / MPIRUN_EXP_ENV_ALL -exe / MPIRUN_EXE -exp_env / MPIRUN_EXP_ENV -partition / MPIRUN_PARTITION -mapfile / MPIRUN_MAPFILE

MPIEXEC Configuration File Syntax -n <Nb Nodes> -wdir <Working Directory> <Binary> Example Configuration File Content -n 32 -wdir /home/bgpuser /bin/hostname -n 32 -wdir /home/bgpuser/hello_world /home/bgpuser/hello_world/hello_world Runs /bin/hostname on one 32 node pset hello_world on one 32 node pset

SUBMIT Command submit = mpirun Command for HTC Command used to run a HTC job and act as a lightweight shadow for the real job running on a Blue Gene node Simplifies user interaction with the system by providing a simple common interface for launching, monitoring, and controlling HTC jobs Run from a Frontend Node Contacts the control system to run the HTC user job Allows the user to interact with the running job via the job's standard input, standard output, and standard error Standard System Location /bgsys/drivers/ppcfloor/bin/submit

HTC Technical Architecture

SUBMIT Command Syntax /bgsys/drivers/ppcfloor/bin/submit [options] or /bgsys/drivers/ppcfloor/bin/submit [options] binary [arg1 arg2... argn] Options -exe <exe> Executable to run -args "arg1 arg2... argn Arguments, must be enclosed in double quotes -env <env=value> Define an environmental for the job -exp_env <env> Export an environmental to the job's environment -env_all Add all current environmentals to the job's environment -cwd <cwd> The job's current working directory -timeout <seconds> Number of seconds before the job is killed -mode <SMP DUAL VNM> Job mode -location <Rxx-Mx-Nxx-Jxx-Cxx> Compute core location, regular expression supported -pool <id> Compute Node pool ID

IBM Scheduler for HTC IBM Scheduler for HTC = HTC Jobs Scheduler Handles scheduling of HTC jobs HTC Job Submission External work requests are routed to HTC scheduler Single or multiple work requests from each source IBM Scheduler for HTC finds available HTC client and forwards the work request HTC client runs executable on compute node A launcher program on each compute node handles work request sent to it by the scheduler. When work request completes, the launcher program is reloaded and client is ready to handle another work request.

IBM Scheduler for HTC Components IBM Scheduler for HTC Purpose Provides features not available with submit interface Queuing of jobs until compute resources are available Tracking of failed compute nodes submit interface is intended for usage by job schedulers Not end users directly IBM Scheduler for HTC Components simple_sched Daemon Runs on Service Node or Frontend Node Accepts connections from startd and client programs startd Daemons Run on Frontend Node Connects to simple_sched, gets jobs and executes submit Client programs qsub = Submits job to run qdel = Deletes job submitted by qsub qstat = Gets status of submitted job qcmd = Admin commands

HTC Executables htcpartition Utility program shipped with Blue Gene Responsible for booting / freeing HTC partitions from a Frontend Node run_simple_sched_jobs Provides instance of IBM Scheduler for HTC and startd Executes commands either specified in command files or read from stdin Creates a cfg file that can be used to submit jobs externally to the cmd files or stdin Exits when the commands have all finished (or can specify keep running )

IBM Scheduler for HTC Integration to LoadLeveler LoadLeveler handles Partition Reservation & Booting New LoadLeveler Keyword # @ bg_partition_type = HTC_LINUX_SMP Partition Shutdown IBM Scheduler for HTC handles Batch of executions queueing Either specified in command files or read from stdin Executions submission Execution recovery when failure occurs Only system faults are recovered Failed submission can be retried User program failures are considered as permanent

IBM Scheduler for HTC Glide-In to LoadLeveler

LoadLeveler Job Command File Example #!/bin/bash # @ bg_partition_type = HTC_LINUX_SMP # @ class = BGP64_1H # @ comment = "Personality / HTC" # @ environment = # @ error = $(job_name).$(jobid).err # @ group = default # @ input = /dev/null # @ job_name = Personality-HTC # @ job_type = bluegene # @ notification = never # @ output = $(job_name).$(jobid).out # @ queue # Command File COMMANDS_RUN_FILE=$PWD/cmds.txt /bgsys/opt/simple_sched/bin/run_simple_sched_jobs $COMMANDS_RUN_FILE

IBM Scheduler for HTC Integration to LoadLeveler < 3.5 Described IBM Scheduler for HTC / LoadLeveler integration is valid for LoadLeveler versions >= 3.5 Looser integration with LoadLeveler versions < 3.5 LoadLeveler doesn t handle partition boot / shutdown Consequences Explicit partition boot / shutdown required in LoadLeveler job command file Achieved through call to HTC binary command htcpartition htcpartition --boot { } htcpartition --free

LoadLeveler Job Command File Example (LL < v3.5) #!/bin/bash # @ class = BGP64_1H # @ comment = "Personality / HTC" # @ environment = # @ error = $(job_name).$(jobid).err # @ group = default # @ input = /dev/null # @ job_name = Personality-HTC # @ job_type = bluegene # @ notification = never # @ output = $(job_name).$(jobid).out # @ queue # Command File COMMANDS_RUN_FILE=$PWD/cmds.txt # Local Simple Scheduler Configuration File SIMPLE_SCHED_CONFIG_FILE=$PWD/my_simple_sched.cfg partition_free() { echo "Freeing HTC Partition" /bgsys/drivers/ppcfloor/bin/htcpartition --free } /bgsys/drivers/ppcfloor/bin/htcpartition --boot --configfile $SIMPLE_SCHED_CONFIG_FILE --mode linux_smp trap partition_free EXIT /bgsys/opt/simple_sched/bin/run_simple_sched_jobs -config $SIMPLE_SCHED_CONFIG_FILE $COMMANDS_RUN_FILE