Lab: Scientific Computing Tsunami-Simulation

Lab: Scientific Computing Tsunami-Simulation Session 4: Optimization and OMP Sebastian Rettenberger, Michael Bader 23.11.15 Session 4: Optimization and OMP, 23.11.15 1

Department of Informatics V Linux-Cluster 1999 2015 https://www.lrz.de/services/compute/linux-cluster/lx_timeline/ Session 4: Optimization and OMP, 23.11.15 2

Hardware Segment #Nodes #Cores aggregate peak performance (TFlop/s) aggregate memory (TByte) CoolMUC2 252 7056 260 16.1 MPP 178 2848 22.7 2.8 UV 2 2080 20.0 6.0 Session 4: Optimization and OMP, 23.11.15 3

Intel Haswell EP http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 Session 4: Optimization and OMP, 23.11.15 4

Rules 1. Do not use login nodes to run your application. 2. Do not start more than one interactive job. 3. Do not put large input/output files in your home directory. Session 4: Optimization and OMP, 23.11.15 5

Some Basics X-forward: ssh -Y user@lxlogin5.lrz.de module system: module load, list, unload, info, avail Compute Interactive: salloc --ntasks=1 --cpus-per-task=28 --time=00:15:00 srun./executable ARGS... Compute Batch: squeue sbatch sinfo sview Session 4: Optimization and OMP, 23.11.15 6

VTune Session 4: Optimization and OMP, 23.11.15 7

VTune Start Analysis Session 4: Optimization and OMP, 23.11.15 8

VTune Start Analysis Session 4: Optimization and OMP, 23.11.15 9

VTune Results Session 4: Optimization and OMP, 23.11.15 10

What is OpenMP? OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs http://openmp.org/openmp-faq.html#whatis Session 4: Optimization and OMP, 23.11.15 11

Department of Informatics V Shared-Memory-Parallelism Transparent programming view: Cores access same memory In hardware: Be aware of layout, e.g. NUMA; first touch policy Sandy Bridge CPU, http://www.bit-tech.net/hardware/ cpus/2011/01/03/intel-sandybridge-review/1 https://computing.llnl.gov/tutorials/openmp/ Knights Corner and Knights Landing, http://www.zdnet.com/sc13-intelreveals-knights-landing-highperformance-cpu-7000023393/ Session 4: Optimization and OMP, 23.11.15 12

Fork-Join Model https://computing.llnl.gov/tutorials/openmp/#programmingmodel Session 4: Optimization and OMP, 23.11.15 13

A Simple Loop # i n c l u d e <omp. h> / / some i n i t i a l i z a t i o n #pragma omp p a r a l l e l f o r f o r ( i = 0; i < n ; i ++) { f o r ( j = 0; j < n ; j ++) { / / do some work } } OpenMP - Parallel programming on shared memory systems https://www.lrz.de/services/software/ parallel/openmp/ Session 4: Optimization and OMP, 23.11.15 14

OMP: An Overview Compiler-directives: #pragma omp...[clause[[,] clause]...] Work-sharing #pragma omp for..., and many more Synchronization #pragma omp barrier, and many more Data Scope Clauses shared, private, firstprivate, reduction,... Library routines omp get num threads(), omp get thread num(),... Session 4: Optimization and OMP, 23.11.15 15

Compilation and Execution compile with additional Compiler-Flag -openmp (-fopenmp in case of gcc): icc -openmp -o my algorithm alg.c export OMP NUM THREADS=number of threads./my algorithm Session 4: Optimization and OMP, 23.11.15 16

Parallel Region #pragma omp p a r a l l e l [ clause [ [, ] clause ]... ] s t r u c t u r e d block code within parallel region is executed by all threads # i n c l u d e <omp. h> / / more i n i t i a l i z a t i o n... #pragma omp p a r a l l e l p r i v a t e ( size, rank ) { size = omp get num threads ( ) ; rank = omp get thread num ( ) ; p r i n t f ( Hello World! ( Thread %d of %d ), rank, size ) ; } Session 4: Optimization and OMP, 23.11.15 17

Work Sharing: for #pragma omp for [clause[[,] clause]...] for-loop within a parallel region and directly in front of a for-loop iterations are scheduled across different threads schedule clause determines how to map iterations to threads schedule(static[, chunksize]) default-value chunksize is #iterations divided by #threads schedule(dynamic[,chunksize]) default-value chunksize is 1 schudule(runtime) Scheduling is given by environment variable OMP SCHEDULE implicit synchronization ad the end of for-loop (can be disabled with nowait clause) shortcut possible: #pragma omp parallel for https://computing. llnl.gov/tutorials/ openmp/ Session 4: Optimization and OMP, 23.11.15 18

Work Sharing: for #pragma omp for [clause[[,] clause]...] for-loop within a parallel region and directly in front of a for-loop iterations are scheduled across different threads schedule clause determines how to map iterations to threads schedule(static[, chunksize]) default-value chunksize is #iterations divided by #threads schedule(dynamic[,chunksize]) default-value chunksize is 1 schudule(runtime) Scheduling is given by environment variable OMP SCHEDULE implicit synchronization ad the end of for-loop (can be disabled with nowait clause) shortcut possible: #pragma omp parallel for Intel, Load Balance and Parallel Performance, https://software. intel.com/enus/articles/loadbalance-and-parallelperformance Session 4: Optimization and OMP, 23.11.15 18

Work Sharing: task #pragma omp task [clause[[,] clause]...] structured block where clause can be... final(expression): if expression is true, task is executed sequentially; no recursive task generation untied: the execution of a task is no tied one single thread shared firstprivate private [...] #pragma omp taskwait: waits for children completion Intel, Load Balance and Parallel Performance, https://software. intel.com/enus/articles/loadbalance-and-parallelperformance Session 4: Optimization and OMP, 23.11.15 19

Work Sharing : single #pragma omp single [clause[[,] clause]...] structured block where clause can be... use only one (arbitrary) thread implicit synchronization #pragma omp master structured block use only master thread for structured block NO synchronization at the end https://computing. llnl.gov/tutorials/ openmp/ Session 4: Optimization and OMP, 23.11.15 20

Synchronization #pragma omp barrier: block execution until all threads have reached the barrier #pragma omp critical: structured block only one thread at a time can execute the structured block encapsulated by critical Warning: Use carefully; can definitely kill performance! Session 4: Optimization and OMP, 23.11.15 21

Reduction reduction(operator: list) executes a reduction of variables in list using operator available operators: +,, -, &&, since OpenMP 3.1: min, max #pragma omp p a r a l l e l f o r p r i v a t e ( r ), r e d u c t i o n ( + : sum) f o r ( i = 0; i < n ; i ++) { r = compute r ( i ) ; sum = sum + r ; } Session 4: Optimization and OMP, 23.11.15 22

Scopes private(list): declares variables in list as private for each thread (no copy!) shared(list): variables in list are used by all threads (race conditions are possible!), write accesses have to be handled by the programmer firstprivate(list): private variables and init them with the latest valid value before parallel region lastprivate(list): after parallel execution, in serial part reused last modification... default data scope is shared, but exceptions exit be precise! local variables are always private loop-variables of for-loops are private Changes between OpenMP-2 and OpenMP-3! Session 4: Optimization and OMP, 23.11.15 23

Fill the Gaps k = compute k ( ) ; #pragma omp p a r a l l e l f o r p r i v a t e (? ), shared (? ), f i r s t p r i v f o r ( i = 0; i < n ; i ++) { k = compute my k ( i, k ) ; r = compute r ( i, k, h ) ; sum = sum + r ; } Session 4: Optimization and OMP, 23.11.15 24

NUMA Non-uniform memory access (NUMA): Not all cores access all memory in the same time Pinning: Use close-to-core memory as far as possible Linux first touch policy: not the malloc counts but the first (write) access numactl has some options Georg Hager s Blog, http://blogs.fau.de/hager/ Session 4: Optimization and OMP, 23.11.15 25