Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1
1 Overview What is parallel software development Why do we need parallel computation? Problems which benefit from parallelization 2 OpenMP - Basics Basic properties Programming Model Basic Syntax 3 OpenMP - Advanced Clauses Directives Synchronization Constructs 4 Pros and Cons Nils Moschüring PhD Student (LMU), OpenMP 2
Acknowledgments This presentation has been heavily influenced by a lecture series organized by Rolf Rabenseifner from the HLRS (Höchstleistungsrechenzentrum Stuttgart) Go to https://fs.hlrs.de/projects/par/events/2013/parallel_prog_2013/ for currently available courses, and to for an overview. To get the appropriate standards visit http://www.hlrs.de/events These are highly recommended! https://fs.hlrs.de/projects/par/par_prog_ws/standards/readme.html Nils Moschüring PhD Student (LMU), OpenMP 4
What is parallel software development Taking Advantage of one or more of the following concepts Pipelining vector computing Functional Parallelism Multi-core (MIMD) Hyper-Threading ccnuma (cache coherent Non-Uniform Memory Access) Array-Processing (SIMD, MMX, SSE2) Nils Moschüring PhD Student (LMU), OpenMP 5
Pipelining instruction nr. 2 1 A B C D A B C D time A IF - Instruction fetch B ID - Instruction decoding C EX - Execution D WB - Write Back Nils Moschüring PhD Student (LMU), OpenMP 6
Pipelining instruction nr. 3 2 1 A B C D A B C D A B C D time Problems: Instruction depends on outcome of previous instruction (branch prediction, pipeline flushing) ressource conflicts data conflicts Nils Moschüring PhD Student (LMU), OpenMP 7
Why do we need parallel computation? Moore s Law: Increase in # of transistors not frequency Increased memory demands One core is too slow Nils Moschüring PhD Student (LMU), OpenMP 8
Problems which benefit from parallelization Matrix-Vector-Multiplication Solving of Systems of linear equations Grid-based algorithms and many more! Nils Moschüring PhD Student (LMU), OpenMP 9
Basic properties Allows incremental parallelization Uses mainly preprocessor directives Easiest approach to multi-threaded programming (shared memory systems only) Nils Moschüring PhD Student (LMU), OpenMP 11
Basic properties Focus on parallelizable loops Serial Program: int main(int argc,char argv) { double res[1000]; for(int i = 0;i<1000;i++) { compl_calc(res[i]); } } Parallel Program: int main(int argc,char argv) { double res[1000]; #pragma omp parallel for for(int i = 0;i<1000;i++) { compl_calc(res[i]); } } Nils Moschüring PhD Student (LMU), OpenMP 12
Basic properties Compile with: gcc -fopenmp test.c To set the maximum number of threads: Set environment variable OMP NUM THREADS to desired value. I.e. (bash): export OMP NUM THREADS=16 And thats it! Nils Moschüring PhD Student (LMU), OpenMP 13
Programming Model Only for shared memory systems (no multiple processes) Workload is distributed among available threads Variable can be shared among all threads duplicated for each thread Threads communicate by sharing variables High risk of race conditions (standard behavior is shared for all variables!) Synchronization procedures are available to control this Nils Moschüring PhD Student (LMU), OpenMP 14
Execution model time sequential parallel sequential parallel sequential # of threads Nils Moschüring PhD Student (LMU), OpenMP 15
Execution model so-called fork-join model start as a process with a single thread (master thread) when parallel pragma is encountered: branch into team of threads completion of pragma: synchronization, implicit barrier continue with master thread Nils Moschüring PhD Student (LMU), OpenMP 16
Parallel regions Basic construct Starts multiple threads Each thread executes the same code redudantly Syntax: #pragma omp parallel [clause [[,] clause ]... ] new line structured block Clause can be private (list) shared (list)... Nils Moschüring PhD Student (LMU), OpenMP 17
Directives case sensitive changes behaviour inside parallel regions Syntax: #pragma omp directive [clause [[,] clause ]... ] new line Nils Moschüring PhD Student (LMU), OpenMP 18
Library functions small amount of library functions available to control OpenMP Usage: #ifdef _OPENMP #include <omp.h> #endif int main(int argc,char argv) { #ifdef _OPENMP printf("nr of procs = %d\n", omp_get_num_procs()); #endif } Nils Moschüring PhD Student (LMU), OpenMP 19
Library functions More available functions void omp_set_num_threads(int) sets # of threads int omp_get_thread_num(void) get current threads number int omp_in_parallel(void) detects if in parallel region... Nils Moschüring PhD Student (LMU), OpenMP 20
Data scope clauses private (list) Declares the variables in list to be private to each thread shared (list) Declares the variables in list to be shared among all threads The default for all variables is shared, execept: local variables in parallel region are private loop control variable is private... Nils Moschüring PhD Student (LMU), OpenMP 22
Reduction clauses Reduction is the process of collecting data from multiple nodes to one node or the scattering of data to multiple nodes. OpenMP offers certain directives to accomplish this. firstprivate (var) initializes the private variable with the value of the nonparallel region lastprivate (var) Copies the last value of var into the variable of the nonparallel region (last iteration for loops and last section for sections, task) reduction (operator:list) performs reduction on variables in list (must be shared in context) with operator operator operator can be +, *, -, &, ˆ,, &&,, max, min at the end of the reduction the shared variable will updated using each of the values in the private copy of each thread using the operator Nils Moschüring PhD Student (LMU), OpenMP 23
Reduction clauses: Example double result = 0.; #pragma omp parallel for reduction(+:result) for(int i = 0; i < 5; i++) { double val = i * i; result += val; } /*omp end parallel for*/ Nils Moschüring PhD Student (LMU), OpenMP 24
Directives Properties: Divide enclosed code among threads Must be inside a parallel region No implicit synchronization on entry Implicit synchronization on exit (nowait clause gets rid of this) Available Directives sections explicitly define different code for different threads for distribute different iterations of following loop onto different threads single block is executed by a single thread only (reduce fork-join overhead) task generates a new task for the following code which will be distributed to one task free thread Nils Moschüring PhD Student (LMU), OpenMP 25
Directives - sections int main(int argc,char argv) { #pragma omp parallel { #pragma omp sections { #pragma omp section { fa();} #pragma omp section { fb();} } /*omp end sections*/ } /*omp end parallel*/ } fa() fb() Executes funca() and funcb() in parallel Nils Moschüring PhD Student (LMU), OpenMP 26
Directives - for #pragma omp parallel private(k) { k = omp_get_thread_num(); #pragma omp for for(int i=0;i<20;i++) a[i]=k*i; } /*omp end parallel*/ a[i]= k*i i= 0..9 a[i]= k*i i= 10..19 Nils Moschüring PhD Student (LMU), OpenMP 27
Directives - for loop must have canonical shape for( [integer or pointer type] var=b;var<e;var=var+incr) different comparisons possible different increasing possible var can not be modified inside the loop b, e, incr invariant during loop # of iterations must be computable at loop begin Nils Moschüring PhD Student (LMU), OpenMP 28
Directives - for Special clauses for for directive collapse collapse nesting loops and their iterations into larger iteration space nowait no synchronization at the end of the parallel loop schedule(type[, chunk]), with type of static statically assign chunks in a round-robin fashion, default chunk size amounts to one piece for each thread, good if all iterations take the same time, deterministic dynamic dynamically assign chunks to idling threads, default chunk size 1, more overhead, but better load balancing guided exponentially decrease the chunk size while dispatching, chunk specifies smallest piece, default chunk size 1 auto Scheduling determined by compiler and/or at run-time runtime Scheduling determined at run-time, using OMP SCHEDULE variable default schedule is implementation specific (so better set it yourself!) Nils Moschüring PhD Student (LMU), OpenMP 29
Directives - single Block is only executed by one thread implicit barrier at the end (unless no wait is specified) reduce fork-join overhead Nils Moschüring PhD Student (LMU), OpenMP 30
Directives - task struct node { struct node left; struct node right; }; void traverse (struct node p) { if(p->left) #pragma omp task traverse(p->left); if (p->right) #pragma omp task traverse(p->right); process(p); //expensive stuff } int main(int argc,char argv) { struct node tree; //fill tree #pragma omp parallel { #pragma omp single { traverse(&tree); } /*omp end single*/ }/*omp end parallel*/ } Nils Moschüring PhD Student (LMU), OpenMP 31
Directives - task Further properties tasks are created when a task pragma is encountered pending tasks are started if a thread is available #pragma omp taskwait can be used to perform task synchronization many clauses available Nils Moschüring PhD Student (LMU), OpenMP 32
Synchronization Constructs - critical Enclosed code is executed by all threads, but restricted to only one thread at a time one can supply a name after this directive to differentiate different critical parts Nils Moschüring PhD Student (LMU), OpenMP 33
Pros and Cons Pros: portable multithreading code data layout and decomposition is handled automatically incremental approach code works in serial without adjustments original code does not change much Cons: risk of race conditions only shared-memory Nils Moschüring PhD Student (LMU), OpenMP 35