Content IBM PSSC Montpellier Customer Center MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler
Control System Service Node (SN) An IBM system-p 64-bit system Control System and database are on this system Access to this system is generally privileged Communication with Blue Gene via a private 1Gb control ethernet Database A commercial database tracks state of the system Hardware inventory Partition configuration RAS data Environmental data Operational data including partition state, jobs, and job history Service action support for hot plug hardware Administration and System status Administration either via a console or web Navigator interfaces
Service Node Database Structure DB2 Configuration Database Operational Database Environmental Database RAS Database Configuration database is the representation of all the hardware on the system Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex Useful log files: /bgsys/logs/bgp
Job Launching Mechanism mpirun Command Standard mpirun options supported May be used to launch any job, not just MPI based applications Has options to allocate partitions when a scheduler is not in use Scheduler APIs enable various schedulers LoadLeveler SLURM Platform LSF Altair PBS Pro Cobalt. Note: All the schedulers are on mpirun/mpiexec
MPIRUN Implementation Identical Functionalities to BG/L Implementation + New implementation + New options No more rsh/ssh mechanism for security reason, replace by a deamon running on the Service node freepartition command integrated as an option (-free) Standard input (STDIN) is supported on BGP (only MPI task 0)
MPIRUN Command Parameters 1 -args "program args" Pass "program args" to the BlueGene job on the compute nodes -cwd <Working Directory> Specifies the full path to use as the current working directory on the compute nodes. The path is specified as seen by the I/O and compute nodes -exe <Executable> Specifies the full path to the executable to run on the compute nodes. The path is specified as seen by the I/O and compute nodes -mode { SMP DUAL VN } specify what mode the job will run in. Choices are coprocessor or virtual node mode -np <Nb MPI Tasks> Create exactly n MPI ranks for the job. Aliases are -nodes and -n
MPIRUN Command Parameters 2 -enable_tty_reporting By default MPIRUN will tell the control system and the C runtime on the compute nodes that STDIN, STDOUT and STDERR are tied to TTY type devices. Enable STDOUT bufferization (GPFS blocksize) -env <Variable Name>=<Variable Value>" Set an environment variable in the environment of the job on the compute nodes -expenv <Variable Name> Export an environment variable in mpiruns current environment to the job on the compute nodes -label Use this option to have mpirun label the source of each line of output. -partition <Block ID> Specify a predefined block to use -mapfile <mapfile> Specify an alternative MPI toplogy. The mapfile path must be fully qualified as seen by the I/O and compute nodes -verbose { 0 1 2 3 4 } Set the 'verbosity' level. The default is 0 which means that mpirun will not output any status or diagnostic messages unless a severe error occurs. If you are curious as to what is happening try levels 1 or 2. All mpirun generated status and error messages appear on STDERR.
MPIRUN Command Reference (Documentation)
MPIRUN Example mpirun partition XXX np 128 mode SMP exe /patch/exe cwd working_directory env OMP_NUM_THREADS=4 XLSMPOPTS=spins=0:yields=0:stack=64000000 Execution Settings 128 MPI Tasks SMP Mode 4 OpenMP Threads 64 MB Thread Stack Mpirun application program interfaces available: get_paramaters, mpirun_done
MPIRUN Environment Variables Most command line options for mpirun can be specified using an environment variable -partition MPIRUN_PARTITION -nodes MPIRUN_NODES -mode MPIRUN_MODE -exe MPIRUN_EXE -cwd MPIRUN_CWD -host MMCS_SERVER_IP -env MPIRUN_ENV -expenv MPIRUN_EXP_ENV -mapfile MPIRUN_MAPFILE -args MPIRUN_ARGS -label MPIRUN_LABEL -enable_tty_reporting MPIRUN_ENABLE_TTY_REPORTING
STDIN / STDOUT / STDERR Support STDIN, STDOUT, and STDERR work as expected You can pipe or redirect files into mpirun and pipe or redirect output from mpirun STDIN may also come from the keyboard interactively Any compute node may send STDOUT or STDERR data Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute nodes that it is writing to TTY devices. This is because logically MPIRUN looks like a pipe; it can not do seeks on STDIN, STDOUT, and STDERR even if they are coming from files. As always, STDIN, STDOUT and STDERR are the slowest ways to get input and output from a supercomputer Use them sparingly STDOUT is not buffered and can generate a huge overhead for some applications Such applications should buffer the stdout with option -enable_tty_reporting
MPIEXEC Command What is mpiexec? Method for launching and interacting with parallel Mutliple Program Multiple Data (MPMD) jobs on BlueGene/P Very similar to mpirun with the only exception being the arguments supported by mpiexec are slightly different Command Limitations A pset is the smallest granularity for each executable, though one executable can span multiple psets You must use every compute node of each pset, specifically different -np values are not supported The job's mode (SMP, DUAL, VNM) must be uniform across all psets
MPIEXEC Command Parameters Only parameter / environmental supported by mpiexec that is not supported by mpirun -configfile / MPIRUN_MPMD_CONFIGFILE The following parameters / environmentals are not supported by mpiexec since their use is ambiguous for MPMD jobs -args / MPIRUN_ARGS -cwd / MPIRUN_CWD -env / MPIRUN_ENV -env_all / MPIRUN_EXP_ENV_ALL -exe / MPIRUN_EXE -exp_env / MPIRUN_EXP_ENV -partition / MPIRUN_PARTITION -mapfile / MPIRUN_MAPFILE
MPIEXEC Configuration File Syntax -n <Nb Nodes> -wdir <Working Directory> <Binary> Example Configuration File Content -n 32 -wdir /home/bgpuser /bin/hostname -n 32 -wdir /home/bgpuser/hello_world /home/bgpuser/hello_world/hello_world Runs /bin/hostname on one 32 node pset hello_world on one 32 node pset
SUBMIT Command submit = mpirun Command for HTC Command used to run a HTC job and act as a lightweight shadow for the real job running on a Blue Gene node Simplifies user interaction with the system by providing a simple common interface for launching, monitoring, and controlling HTC jobs Run from a Frontend Node Contacts the control system to run the HTC user job Allows the user to interact with the running job via the job's standard input, standard output, and standard error Standard System Location /bgsys/drivers/ppcfloor/bin/submit
HTC Technical Architecture
SUBMIT Command Syntax /bgsys/drivers/ppcfloor/bin/submit [options] or /bgsys/drivers/ppcfloor/bin/submit [options] binary [arg1 arg2... argn] Options -exe <exe> Executable to run -args "arg1 arg2... argn Arguments, must be enclosed in double quotes -env <env=value> Define an environmental for the job -exp_env <env> Export an environmental to the job's environment -env_all Add all current environmentals to the job's environment -cwd <cwd> The job's current working directory -timeout <seconds> Number of seconds before the job is killed -mode <SMP DUAL VNM> Job mode -location <Rxx-Mx-Nxx-Jxx-Cxx> Compute core location, regular expression supported -pool <id> Compute Node pool ID
IBM Scheduler for HTC IBM Scheduler for HTC = HTC Jobs Scheduler Handles scheduling of HTC jobs HTC Job Submission External work requests are routed to HTC scheduler Single or multiple work requests from each source IBM Scheduler for HTC finds available HTC client and forwards the work request HTC client runs executable on compute node A launcher program on each compute node handles work request sent to it by the scheduler. When work request completes, the launcher program is reloaded and client is ready to handle another work request.
IBM Scheduler for HTC Components IBM Scheduler for HTC Purpose Provides features not available with submit interface Queuing of jobs until compute resources are available Tracking of failed compute nodes submit interface is intended for usage by job schedulers Not end users directly IBM Scheduler for HTC Components simple_sched Daemon Runs on Service Node or Frontend Node Accepts connections from startd and client programs startd Daemons Run on Frontend Node Connects to simple_sched, gets jobs and executes submit Client programs qsub = Submits job to run qdel = Deletes job submitted by qsub qstat = Gets status of submitted job qcmd = Admin commands
HTC Executables htcpartition Utility program shipped with Blue Gene Responsible for booting / freeing HTC partitions from a Frontend Node run_simple_sched_jobs Provides instance of IBM Scheduler for HTC and startd Executes commands either specified in command files or read from stdin Creates a cfg file that can be used to submit jobs externally to the cmd files or stdin Exits when the commands have all finished (or can specify keep running )
IBM Scheduler for HTC Integration to LoadLeveler LoadLeveler handles Partition Reservation & Booting New LoadLeveler Keyword # @ bg_partition_type = HTC_LINUX_SMP Partition Shutdown IBM Scheduler for HTC handles Batch of executions queueing Either specified in command files or read from stdin Executions submission Execution recovery when failure occurs Only system faults are recovered Failed submission can be retried User program failures are considered as permanent
IBM Scheduler for HTC Glide-In to LoadLeveler
LoadLeveler Job Command File Example #!/bin/bash # @ bg_partition_type = HTC_LINUX_SMP # @ class = BGP64_1H # @ comment = "Personality / HTC" # @ environment = # @ error = $(job_name).$(jobid).err # @ group = default # @ input = /dev/null # @ job_name = Personality-HTC # @ job_type = bluegene # @ notification = never # @ output = $(job_name).$(jobid).out # @ queue # Command File COMMANDS_RUN_FILE=$PWD/cmds.txt /bgsys/opt/simple_sched/bin/run_simple_sched_jobs $COMMANDS_RUN_FILE
IBM Scheduler for HTC Integration to LoadLeveler < 3.5 Described IBM Scheduler for HTC / LoadLeveler integration is valid for LoadLeveler versions >= 3.5 Looser integration with LoadLeveler versions < 3.5 LoadLeveler doesn t handle partition boot / shutdown Consequences Explicit partition boot / shutdown required in LoadLeveler job command file Achieved through call to HTC binary command htcpartition htcpartition --boot { } htcpartition --free
LoadLeveler Job Command File Example (LL < v3.5) #!/bin/bash # @ class = BGP64_1H # @ comment = "Personality / HTC" # @ environment = # @ error = $(job_name).$(jobid).err # @ group = default # @ input = /dev/null # @ job_name = Personality-HTC # @ job_type = bluegene # @ notification = never # @ output = $(job_name).$(jobid).out # @ queue # Command File COMMANDS_RUN_FILE=$PWD/cmds.txt # Local Simple Scheduler Configuration File SIMPLE_SCHED_CONFIG_FILE=$PWD/my_simple_sched.cfg partition_free() { echo "Freeing HTC Partition" /bgsys/drivers/ppcfloor/bin/htcpartition --free } /bgsys/drivers/ppcfloor/bin/htcpartition --boot --configfile $SIMPLE_SCHED_CONFIG_FILE --mode linux_smp trap partition_free EXIT /bgsys/opt/simple_sched/bin/run_simple_sched_jobs -config $SIMPLE_SCHED_CONFIG_FILE $COMMANDS_RUN_FILE