Practical Introduction to

Size: px

Start display at page:

Download "Practical Introduction to"

Jessie Wilfred Fowler
5 years ago
Views:

1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it?

How to run programs efficiently on ScaleMP? Bart Oldeman, Calcul Que bec McGill HPC Bart.

ca 3 What is the ScaleMP node 4 What is the ScaleMP node Collection of nodes that behaves like

Consists of 11 nodes with 12 cores per node and 8GB of memory per core.

1 1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it? How do we run codes on the ScaleMP node on the ScaleMP Guillimin cluster? How to run programs efficiently on ScaleMP? Bart Oldeman, Calcul Que bec McGill HPC Bart.Oldeman@mcgill.ca 3 What is the ScaleMP node 4 What is the ScaleMP node Collection of nodes that behaves like a single computer with a large shared memory. Consists of 11 nodes with 12 cores per node and 8GB of memory per core. Because of software running on these nodes, ScaleMP appears like a single node with 132 cores and 925GB of memory ( % overhead). Therefore, this system is particularly useful for researchers who require access to large amounts of memory. 5 slaves with USB sticks master node

2 What is ScaleMP ScaleMP, Inc. is a software vendor: 5 The software is called vsmp Foundation, where vsmp stands for virtual symmetric multiprocessing, a form of hardware virtualization. Computer memory design: NUMA, which stands for Non-Uniform Memory Access. Strategic portions of common memory are cached on individual physical nodes (or boards ) When to use the ScaleMP node If the memory requirements of a shared memory threaded (OpenMP/PThreads) or serial (single core) job are too high for the lm (large memory) queue (72 GB, 12 cores per node). You want to use more than 12 cores for a shared memory job which does not require tight synchronization between threads. If the per process memory requirements of an MPI job are higher than 72 GB. This is not very common, since MPI uses distributed memory and such jobs can often effectively be run on the hb queue with 2 GB per core. 6 Exercise 1: Log in to Guillimin, setting up the environment 1) Log in to Guillimin: ssh username@guillimin.clumeq.ca 2) Copy all files to your home directory: guillimin> cp /software/workshop/scalemp/*. 7 Exercise 2: How to submit a job to scalemp On the Guillimin cluster, we use the batch system to submit jobs to ScaleMP! Example: hello.pbs: 1) View the file hello.pbs : guillimin> cat hello.pbs #PBS -l nodes=1:ppn=1 #PBS -N hello echo Hello from ScaleMP. Memory usage: > hello.out free -g >> hello.out 2) Submit your job: guillimin> msub -q scalemp hello.pbs 8

3 Exercise 2: How to submit a job to scalemp 3) Check the job status: guillimin> showq -u <username> 4) Check the output (hello.out) 9 10 Processor affinity/cpu pinning Binds threads to physical cores. Without affinity, the process may be spread out accross multiple nodes, having threads migrated by the OS, and so on. Critical to performance on ScaleMP: use as few physical nodes as possible. Programs with extensive synchonization between threads may not even run on multiple nodes! Controlled using Linux utilities (taskset and numactl), the ScaleMP utility numabind and environment variables (KMP AFFINITY and GOMP CPU AFFINITY). Affinity: some benchmarks 11 OpenMP jobs 12 Example: AUTO-07p (cmvl.cs.concordia.ca/auto) Job type Time (sec.) Speed-up Serial time without I/O scalemp board (12 cores), affinity compact cores on scalemp affinity scatter hb nodes (48 cores), hybrid MPI/OpenMP scalemp boards (48 cores), affinity compact cores on scalemp, affinity scatter Too much synchronization in this program for effective multi-board scalemp use. Reserve number of cores (ppn) according to memory and core usage. Memory usage: 7GB per core (8GB, 12.5% overhead). Example: 12-core program uses 100 GB: reserve max(100/7,12)=15 cores (rounded up). Example: 24-core program uses 100 GB: reserve max(100/7,24)=24 cores. Recommended to use Intel compiler and KMP AFFINITY environment variable.

4 13 OpenMP jobs Exercise 3: OpenMP example 14 Recommended: compact spacing of 12 threads on one board: export KMP AFFINITY=compact,verbose,0, numabind --offset 12 Compact spacing of 24 threads spreaded evenly across 2 boards (ppn=24): export KMP AFFINITY=compact,verbose,0, numabind --offset 24 See products/documentation/studio/composer/ en-us/2011update/compiler_c/optaps/ common/optaps_openmp_thread_affinity.htm for more advanced examples. Example: matrix multiplication. module add ifort icc icc -openmp openmp-mm.c -o openmp-mm Exercise 3: OpenMP example 15 Exercise 4: OpenMP example 16 Job script openmp.pbs #PBS -l nodes=1:ppn=12 #PBS -N openmp-mm # Hoard library: improve dynamic memory performance. export LD PRELOAD=$LD PRELOAD:/usr/lib/libhoard.so # Add Intel libraries and numabind to PATH module add ifort icc ScaleMP/numabind # Dynamic binding of OpenMP threads using numabind. export KMP AFFINITY=compact,verbose,0, numabind --offset 12 export OMP NUM THREADS=12 /usr/bin/time./openmp-mm &> openmp.out Change export KMP AFFINITY line to export KMP AFFINITY=scatter,verbose,0,0 and see the difference.

5 17 Serial jobs Still need to reserve multiple cores according to memory usage! Use the taskset utility as follows: taskset -c numabind --offset 1./serialprog Exercise 5: serial example module add ifort icc icc serial-mm.c -o serial-mm Job script serial.pbs 18 #PBS -l nodes=1:ppn=12 #PBS -N serial-mm # Add Intel libraries and numabind to PATH module add ifort icc ScaleMP/numabind # Bind thread to one core using numabind and taskset. taskset -c numabind --offset 1 /usr/bin/time./serial-mm &> serial.out PThreads jobs Like OpenMP but the environment variable KMP AFFINITY is ignored. Two alternatives: 1. Use first= numabind --offset 12 last=$(($first+11)) taskset -c $first-$last./pthreadprog 2. Adjust affinity when the program is already running. Create file myconfig on one line: name=pthreadprog pattern=pthreadprog verbose=0 process allocation=multi task affinity=cpu rule=rule-procgroup.so flags=ignore idle then use numabind --config myconfig on the running program. 19 Exercise 6: PThreads example module add ifort icc icc -pthread ptest.c -o ptest Job script ptest.pbs #PBS -N ptest # Add Intel libraries and numabind to PATH module add ifort icc ScaleMP/numabind # Bind thread to contiguous cores using numabind and taskset. first= numabind --offset 12 last=$(($first+11)) taskset -c $first-$last /usr/bin/time./ptest &> ptest.out & ps -elo pid,lwp,time,ucmd,psr grep ptest >> ptestps.out 2>&1 wait

6 21 Exercise 7: PThreads example Modify ptest.pbs to use numabind --config myconfig, with ps output before and after. 22 MPI jobs Only for special cases, where individual ranks are too big for the lm queue. Otherwise use lm with (for example) ppn=12 and mpiexec -npernode 1 for 72GB per rank. Use specially tuned version of MPICH2, using module add ScaleMP/mpich2. Specify affinity using VSMP PLACEMENT environment variable. For example: PACKED: contiguous, SPREAD^2^12, on two nodes. Can also use precompiled OpenMPI and MPICH1 executables, using wrapper (see /software/scalemp/examples/{openmpi,mpich1}). Exercise 8: MPI example module add ifort icc ScaleMP/mpich2 mpicc -cc=icc mpi-mm.c -o mpi-mm Job script mpi.pbs 23 #PBS -l nodes=1:ppn=12 #PBS -N mpi-mm # Add Intel libraries, numabind, and mpich2 to PATH module add ifort icc ScaleMP/numabind ScaleMP/mpich2 # Run the two processes on two physical nodes export VSMP PLACEMENT=SPREAD^2^12 export VSMP VERBOSE=yes /usr/bin/time mpiexec -n 2./mpi-mm &> mpi.out Further readings: 24 ScaleMP: Documentation: guillimin-getting-started/scalemp-system More examples in the Guillimin directory: /software/scalemp/examples Other cases: MKL, Throughput.

Practical Introduction to Message-Passing Interface (MPI)

1 Outline of the workshop 2 Practical Introduction to Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Theoretical / practical introduction Parallelizing your