Blue Gene/Q User Workshop. User Environment & Job submission

Size: px

Start display at page:

Download "Blue Gene/Q User Workshop. User Environment & Job submission"

Joshua Walker
5 years ago
Views:

1 Blue Gene/Q User Workshop User Environment & Job submission

2 Topics Blue Joule User Environment Loadleveler Task Placement & BG/Q Personality 2

3 Blue Joule User Accounts Home directories organised on a project basis /home/[project name]/[pi]/[username]-[pi] Each project has a shared directory $HOME/../shared Writable/Readable by all project members All members of a project have the same group and can read each other home-dirs You cannot read your own home-dirs in other projects Per-User storage: Unlimited (4.6 PB in total) Archiving and Back-Up: None $HOME directory is used for all runs there is no $WORK/$TEMP dir 3

4 Blue Joule System Access Front-End Node (FEN) joule.hartree.stfc.ac.uk Access via ssh key-exchange only To access from other hosts Create new private key and add public key to.ssh/authorized_keys Copy the private key to.ssh/ on host (must not be world-readable) Remote copying use scp or rsync as usual However to copy between your home-dirs in different projects directly you will need private/public key-pairs to be present in the source/dest accounts Accessing outside services (wget, curl etc) export http_proxy= set ftp_proxy & https_proxy to be the same Svn: Edit ~/.subversion/servers adding http-proxy-host = http-proxy-port =

5 Blue Joule Environment Modules Open source utility used to managed centrally available software on Hartree Systems module avail shows available software module load [name] loads the available software into your environment Sets the value of various environment variables to make the software available: PATH, LD_RUN_PATH, LD_LIBRARY_PATH etc module show [name] Shows the variables set by loading the module Other useful commands: module unload [name], module list The main module you will need to load is ibmmpi You have to explicitly define the module function in job-scripts if you want to load modules source /etc/profile.d/modules.sh (bourne based shells) source /etc/profile.d/modules.csh (csh derived shells) 5

6 BlueJoule Configuration 6 production racks Torus: 12*4*8*8*2 Midplanes: 3*1*2 *2 1 development rack (BGAS) Torus: 4*4*4*8*2 Midplanes: 1*1*1*2 1 I/O node per 4 node boards minimum block size 128 nodes Job scheduler: loadleveler Jobs are queued based on required resources (nodes and walltime) For the workshop hands-ons we have exclusive use of the development rack 8*128 node blocks 6

7 BG/Q Driver The collection of BG/Q specific software is referred to as a driver Kernel interfaces, Compute Node Kernel (CNK) gcc toolchain/binutils, BG/Q commands, mpiwrapper scripts, communication libraries, the CNK There are multiple driver versions current V1M1R2 located under /bgsys/drivers/ Makes it possible to switch drivers e.g. on discovering a problem /bgsys/drivers/ppcfloor is a link to the current driver The ibmmpi module sets a number of the paths necessary to use/access current driver software Mpi wrappers, communication libs (xl version), kernel headers and libraries Developer driver directories spi headers/libraries for accessing BGQ hardware features comm Various communication libraries gnu-linux The gnu-linux toolchain (gcc,cpp,binutils etc). bgpm Hardware performance monitor Note: The driver does not contain the xl compilers but it does contain the gcc compilers (and the rest of the toolchain). Xl compilers are located under /opt/ibmcmp/ 7

8 The runjob command

9 runjob: Overview Command syntax runjob <options> : <executable> <arguments> Example runjob --cwd /scratch/nt05984s/hello_world --env-all --label --np 2 --ranks-per-node 32 :./hw Important Notice The memory per process is allocated per job based on ranks-per-node (16GBytes shared memory)/ ranks-per-node Argument Value Purpose --cwd Working directory Change to execution directory --env-all --envs - <var>=<value> Export environment (all or specific variable) --label - Prefix stdout records with MPI rank --np # MPI tasks Total number of MPI tasks 9

10 runjob: np, ranks-per-node & threads The total number of (mpi) processes you want to run is controlled by the np parameters e.g --np 1024 The maximum number of mpi processes placed on each node is controlled by the ranks-per-node parameters e.g. --ranks-per-node 16 Process are assigned to nodes up to this value before moving to the next node The total number of nodes you require is the (--np)/(--ranks-per-node) e.g (1024/16) = 64 The number of OpenMP threads per process is controlled by OMP_NUM_THREADS There are maximum 64 OMP threads Therefore (--ranks-per-node)*(omp_num_threads) <= 64 Memory is assigned to a process based on ranks-per-node Each process gets 32/(--ranks-per-node) GB Note: if np < ranks-per-node then each process still only gets the above amount of memory e.g. np=1 and ranks-per-node=16 one process on one node with 2GB of memory 10

11 runjob: Process Mapping The default mapping is to place MPI ranks on the system in ABCDET order where the rightmost letter increments first, and where <A,B,C,D,E> are torus coordinates and T is the processor ID in each node (T = 0 to N -1, where N is the number of processes per node being used) Change default mapping runjob --mapping TEDCBA or --mapping my.map Note: Mapping will be covered in detail later 11

12 MPMD execution Multiple program multiple data (MPMD) jobs are jobs for which a different executable and arguments can be supplied for a single job. All tasks of the job share MPICOMMWORLD and can share data between different executables via the torus. To enable MPMD support, specify a mapping file with the runjob mapping option. Within the mapping file, there are keywords that control MPMD behavior on the nodes. #mpmdbegin {ranks} #mpmdcmd <executable> <arg0> <arg1>... <argn> #mpmdend {ranks} specifies the MPI rank numbers. Multiple MPI ranks can be specified with a comma. #mpmdbegin 3,6,9 It is also possible to specify ranges of MPI ranks using a dash. For example: #mpmdbegin 0-15 Additionally, ranges can be specified with a stride 'x' option. #mpmdbegin 0-15x2 (Ranks 0, 2, 4, 6, 8, 10, 12, and 14 are included). Sets and ranges can also be mixed: #mpmdbegin 0,2,5-15 However, care must be taken to avoid oversubscribing a rank to multiple programs. There is also a restriction on MPMD ranks. All ranks in the same node must have the same program. 12

13 Tools the user wants to launch a tool during their job, they need to tell the Control System. The start_tool executable is used to tell the Control System to start tool-daemons on all of the I/O nodes that are servicing the job to be controlled or monitored. The tools communicate with the Common I/O Services (CIOS) daemons running on the I/O nodes to pass messages back and forth to the compute nodes, where the user code is running. No user code runs on the I/O node; it acts as a proxy/manager for compute nodes. $~> start_tool -id 123 -tool /path/to/my_great_tool args one hello two world $~> end_tool -id 123 -tool 1 13

14 LoadLeveler

15 Changes in Blue Gene Terminology Terminology changes have been made in Blue Gene/Q Terminology in BG/[L,P] Terminology in BG/Q Base Partitions Midplanes Partitions Blocks Wires Cables NodeCards NodeBoards LoadLeveler externals will reflect these changes to be consistent with Blue Gene/Q 15 Jan 31,

16 LoadLeveler Job Command File The following table summarizes the main keywords required Keyword Notes class queue to submit to (prod) bg_nodes Number of nodes executable runjob/command file name (script) job_type blue_gene wall_clock_limit Determines queue 16 Jan 31,

17 LoadLeveler Command File Variables $(home) The home directory for the user on the cluster selected to run the job. $(jobid) The sequential number assigned to this job by the Schedd daemon. The $ (jobid) variable and the $(cluster) variable are equivalent. $(stepid) The sequential number assigned to this job step when multiple queue statements are used with the job command file. The $(stepid) variable and the $(process) variable are equivalent. $(user) The user name on the cluster selected to run the job. The following keywords are also available as variables if defined in the job command file. $(executable) $(class) $(comment) $(job_name) $(step_name) 17 Jan 31,

18 LoadLeveler Job Command File Sample Job Command File: job_type = bluegene class = prod error = size512.$(host).$(cluster).$(process).err output = size512.$(host).$(cluster).$(process).out executable = /bgsys/drivers/ppcfloor/hlcs/bin/runjob arguments = --exe /bin/date bg_size = 512 bg_connectivity = Torus queue 18 Jan 31,

19 LoadLeveler Job Command File Sample Job Command File (script) #! /bin/bash job_type = bluegene class = prod error = size512.$(host).$(cluster).$(process).err output = size512.$(host).$(cluster).$(process).out executable = loadleveler-script.sh bg_size = 512 bg_connectivity = Torus queue export BG_THREADLAYOUT=2 llq -l $LOADL_STEP_ID runjob env-all ranks-per-node 16 :./myexec 19 Jan 31,

20 LoadLeveler Multistep Jobs A job-command file can specify multiple jobs Each job is termed a job-step Useful when a computation is composed of multiple parts Each job-step is delimited by the #queue keyword Values of keywords are inherited from previous job-steps Each job-step is treated independently by default Can use the dependancy keyword to specify dependancy Uses the exit status of a previous job-step to determine if it should be executed 20 Jan 31,

21 Shapes and Connectivity BG/Q supports 5 dimensional shapes, AxBxCxDxE. The 5 th dimension (E) is located on the Node Board and always size 2. Therefore, for scheduling large block jobs, (>= 1 midplane), only a 4 dimensional shape is allowed. In BG/P, connectivity was specified for the block as a whole, Torus or Mesh. In BG/Q, connectivity can now be specified for large blocks per each dimension, A, B, C, and D with values Torus or Mesh. 21 Jan 31,

22 LoadLeveler Job Command File The following table summarizes changes in the job command file (JCF) keywords that are applicable to Blue Gene type jobs Keyword Type Notes bg_block new replaces bg_partition bg_shape existing accepts 4D shape only bg_connectivity new replaces bg_connection bg_requrements existing no change bg_rotate existing no change bg_node_configuration new 22 Jan 31,

23 LoadLeveler Job Command File - bg_shape = AxBxCxD Specifies a 4 dimensional shape (large block) to create for the BG job Example: bg_shape=1x2x1x1, specifies a job requesting 2 Midplanes in the B dimension bg_connectivity = Torus Mesh Either Xa Xb Xc Xd where Xa = Torus Mesh Xb = Torus Mesh Xc = Torus Mesh Xd = Torus Mesh Specifies the connectivity of large blocks for the entire block, or per each dimension Example: bg_connectivity = Torus Mesh Torus Torus, requests Mesh connectivity in the B dimension, Torus connectivity in the A,C, and D dimensions 23 Jan 31,

24 LoadLeveler Job Command File bg_rotate = true false Specifies whether the scheduler should rotate the shape when trying to find a block to run the job. Connectivity is preserved on the dimension when rotating the shape. Example: bg_rotate = false, do not rotate the shape when scheduling the job bg_node_configuration = node_configuration Option that allows users to specify a customer node configuration to use when booting the block. Node configurations are created by the administrator from the bg_console. The default node configuration used is CNKDefault. 24 Jan 31,

25 LoadLeveler Commands Normal LoadLeveler operating commands are identical in BG/Q as they are in BG/P. Some examples are provided in the table below. Action Command Submitting Jobs llsubmit Monitoring Jobs llq Modify Attributes for Idle Jobs llmodify Making Reservations llmkres Changing Reservations llchres 25 Jan 31,

26 LoadLeveler Commands Normal LoadLeveler operating commands are identical in BG/Q as they are in BG/P. Some examples are provided in the table below. Action Command Submitting Jobs llsubmit Monitoring Jobs llq Modify Attributes for Idle Jobs llmodify Making Reservations llmkres Changing Reservations llchres 26 Jan 31,

27 Query Blue Gene System A new command, llbgstatus, is provided to query Blue Gene system information Usage: llbgstatus [-? -H -v [-X cluster_list -X all] [-l -M all -M midplane_list -B all -B block_list]] Blue Gene query options have been removed from the llstatus command 27 Jan 31,

28 Query Blue Gene System The table below shows the previous llstatus command used in BG/[L,P] and the new llbgstatus equivalent command in BG/Q Description BG/[L,P] Command BG/Q Command Query BG Machine llstatus b llbgstatus Query BG Machine (details) llstatus b l llbgstatus -l Query Midplanes llstatus B R00-M0 llbgstatus M R00-M0 Query Blocks llstatus P LL llbgstatus B LL Jan 31,

29 Task placement Processor affinity

30 Compute Node resources The compute node contains 17 physical cores 16 physical cores are dedicated to user application 1 core is dedicated to system Each core has 4-way SMT (can run 4 threads) 64 hardware threads in total Each hardware thread can support a fixed maximum number of software threads (pthreads) the current number is five Once a pthread is bound to a hardware thread, it does not move from that hardware thread unless acted upon by a set affinity action such as pthread_setaffinity. The main thread of a process cannot be moved to another hardware thread. It remains on the original hardware thread it was started on Naming conventions Processor Core IDs identify physical cores Ranging in values from 0 to 16 (17 by node in total) Processor Thread IDs identify (SMT) threads on the same physical core Ranging in values from 0 to 3 (4 by physical core in total) Hardware threads can also be identified by the Processor ID Ranging in values from 0 to 67 (68 by node in total) 30

31 Processor Core IDs / Thread IDs / Processor IDs Processor Core ID Thread ID Processor ID 0 0,1,2,3 0,1,2,3 1 0,1,2,3 4,5,6,7 2 0,1,2,3 8,9,10,11 0,1,2,3 14 0,1,2,3 56,57,58, ,1,2,3 60,61,62, ,1,2,3 64,65,66,67 31

32 Execution modes Jobs can be run with varying numbers of processes per node From 1 process per node to 64 processes per node 1 process per node = minimum node utilization But multi-threading can occur 64 processes per node = all hardware threads occupied by one distinct process Hardware threads are dedicated to single user process/thread All processes are given an equal number of hardware threads across the node These hardware threads can be used for multi-threading 32

33 # processes per node # physical cores per process # hardware threads per process

34 Threads affinity/layout Breadth-first assignment Breadth-first is the default thread layout algorithm. This algorithm corresponds to BG_THREADLAYOUT = 1. With breadth-first assignment, the hardware thread-selection algorithm progresses across the cores assigned to the process before selecting additional threads within a given core. Round-robin allocation Depth-first assignment This algorithm corresponds to BG_THREADLAYOUT = 2. With depth-first assignment, the hardware thread-selection algorithm progresses within each core before moving to another core defined within the process. Fill-up allocation Processor affinity is enforced by the Blue Gene control system Processes are assigned to one or more hardware threads At job initialization time Assignment will not change for the life of the job 34

35 Thread layout Breadth assignment [BG_THREADLAYOUT=1] Processes per node Processor ID assignment order 1 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60, 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62, 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61, 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63 2 0,4,8,12,16,20,24,28,2,6,10,14,18,22,26,30, 1,5,9,13,17,21,25,29,3,7,11,15,19,23,27,31 32,36,40,44,48,52,56,60,34,38,42,46,50,54,58,62, 33,37,41,45,49,53,57,61,35,39,43,47,51,55,59,63 0,4,8,12,2,6,10,14,1,5,9,13,3,7,11, ,20,24,28,18,22,26,30,17,21,25,29,19,23,27,31 32,36,40,44,34,38,42,33,37,41,45,46,35,39,43,47 48,52,56,60,50,54,58,62,49,53,57,61,51,55,59,63 35

36 Thread layout Depth assignment [BG_THREADLAYOUT=2] Processes per node Processor ID assignment order 1 0,2,1,3,...,60,62,61,63 2 0,2,1,3,...,28,30,29,31 32,34,33,35,...,60,62,61,63 0,2,1,3,...,12,14,13, ,18,17,19,...28,30,29,31 32,34,33,35,...,44,46,45,47 48,50,49,51...,60,62,61,63 36

37 BG/Q Personality

38 Personality Definition Two definitions Static data given to every compute node and I/O node at boot time by the Control System Contains information that is specific to the node, with respect to the block that is being booted. Set of C language structures that contains such items as the node coordinates on the torus network Useful to determine, at run time, where the tasks of the application are running Allows fine tuning of the application performance For instance: which set of tasks shares the same I/O node 38

39 Personality Usage Include file #include <spi/include/kernel/location.h> Structure Personality_t pers Query function Kernel_GetPersonality(&pers, sizeof(pers)); Properties pers.network_config.[a-e]nodes Nb nodes in each torus dimension pers.network_config.[a-e]coord Coordinates of the nodes in the torus pers.network_config.[a-e]bridge Coordinates of the IO bridges in the torus Other routines Kernel_ProcessorID() Processor ID (0-63) Kernel_ProcessorCoreID() Processor core ID (0-15) Kernel_ProcessorThreadID() Processor thread ID (0-3) 39

40 Additional Slides 40

41 Job submission architecture 41

42 LoadLeveler Job Command File job_type = bluegene This keyword must be specified in the JCF to identify a Blue Gene job bg_size = <number> Specifies the number of compute nodes requested by the BG job Example: bg_size=256, specifies a job requested 256 compute nodes bg_block = <block name> Specifies the name of the block created outside of LoadLeveler to run the job Example: bg_block = s32, requests the job runs on compute block s32 42 Jan 31,

Introduction to HPC Numerical libraries on FERMI and PLX

Introduction to HPC Numerical libraries on FERMI and PLX HPC Numerical Libraries 11-12-13 March 2013 a.marani@cineca.it WELCOME!! The goal of this course is to show you how to get advantage of some of