Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/
Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and running OpenMP programs OpenMP fundamentals Approaches to Parallelism Data dependencies Shared / private variables OpenMP directives / functions Overview Examples & Projects
Shared Memory Programming
Shared Memory Programming Shared memory systems like multi-core workstations have a single address space: Applications can be developed in which loop iterations (with no dependencies) are executed by different processors Shared memory codes are mostly data parallel, SIMD kinds of codes OpenMP is the new standard for shared memory programming (compiler directives) Vendors offer native compiler directives (gcc, icc, xlc, gfortran, ifort, xlf, etc.)
OpenMP programming model OpenMP (Open Multi-Processing) is An API used to explicitly direct multithreaded, shared memory parallelism Programming Model Parallelism is achieved through the use of threads All threads share address space Explicit Parallelism Offers full control over parallelization Can be as simple as taking a serial program and inserting compiler directives Simple to use Most parallelism specified through simple compiler directives, small API Works in shared memory systems only Thread Address Space
Fork-Join Model All OpenMP programs begin as a single process (master thread) Fork: Master thread creates a team of parallel threads The statements in the parallel region are executed in parallel among various team threads Join: When team threads complete the parallel region they synchronize and terminate The number of parallel regions and threads in them are arbitrary
Compiler Directives Appear as comments in the source code and are ignored unless compilers are told otherwise Are used for various purposes Spawning a parallel region Dividing blocks of code among threads Distributing loop iterations between threads Serializing sections of code Synchronization of work among threads /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) {... } /* All threads join master thread and disband */ example
Run-time Library Routines API includes a set of routines for Setting and querying the number of threads Querying a thread's unique identifier (thread ID), a thread's ancestor's identifier, the thread team size Setting, initializing and terminating locks and nested locks Querying wall clock time and resolution etc. /* Obtain thread number */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); /* Obtain total number of threads number */ nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); example
OpenMP Example #include <omp.h> #include <stdio.h> #include <stdlib.h> http://www.openmp.org/ int main (int argc, char *argv[]) { int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) { /* Obtain thread number */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); Split the next block over all available threads } /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); } } /* All threads join master thread and disband */ All threads do this Only master thread does this
Compiling/Running OpenMP programs OpenMP programs can be compiled using a compiler supporting OpenMP: $ gcc -fopenmp HelloWorld.c -o HelloWorld Launch programs as usual; number of threads can be controlled through the OMP_NUM_THREADS environment variable: bash$ export OMP_NUM_THREADS=3 bash$./omp_hello_world Hello World from thread = 1 Hello World from thread = 0 Hello World from thread = 2 Number of threads = 3 bash$
Environment Variables OpenMP provides several environment variables for controlling the execution of parallel code at run-time. Setting the number of threads Specifying how loop iterations are divided Binding threads to processors Enabling/disabling dynamic threads etc. bash$ export OMP_NUM_THREADS=3 bash$./omp_hello_world Hello World from thread = 1 Hello World from thread = 0 Hello World from thread = 2 Number of threads = 3 bash$ example
OpenMP fundamentals
Approaches to Parallelism Two main approaches for distributing work among threads: Parallel loops Individual loops are parallelized by assigning each thread a range of loop indexes Parallel regions The code launches a number of threads, each with an unique id Up to the programmer to split the workload Code outside these sections will be executed serially
Data dependencies Not all operations in the code can be performed in parallel: Some operations require other operations to complete first: for (i=0; i < N; i++) { a[i] = a[i] + a[i-1]; } example When an operation depends on the result of another one this is called a data dependency A loop can be straightforwardly parallelized if there are no data dependencies: All assignments are performed on arrays Each element of array is written by at most one iteration No loop iterations read array elements modified by other iteration
Shared / private variables Since we are in a shared memory environment, variables inside a parallel region share the same address For the loop index in a parallel loop this would of course pose a problem: different threads require different values of this variable. OpenMP offers control as to how variables are shared among threads, or kept private inside each thread using clauses: private: Each thread as a different copy of the variable. This is the default for the loop index variable. shared: All threads share the variable. This is the default for all other variables. #pragma omp for private(tmp) shared(a,b,c) for (i=0; i < N; i++) { tmp = 2 * a[i]; a[i] = tmp; b[i] = c[i] / tmp; } example
OpenMP Yee field solver Algorithm Spawn nt threads Each thread handles a given field grid region inside node No algorithm overhead Only 2 lines of OpenMP code!!$omp parallel do private(i2,i1) do i3 = 0, b%nx(3) + 1 do i2 = 0, b%nx(2) + 1 do i1 = 0, b%nx(1) + 1 enddo enddo enddo b%f3( 1, i1, i2, i3 ) =... b%f3( 2, i1, i2, i3 ) =... b%f3( 3, i1, i2, i3 ) =...!$omp end parallel do Thread 1 Thread 2 Thread 3 Shared Memory Node Local E,B Field grid
OpenMP directives / functions
Parallel for construct Specifies that the iterations of the loop immediately following it must be executed in parallel by the team C C++ #pragma omp for [clause...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) collapse (n) nowait for(...){... } The amount of work (chunk) for each thread can be controlled Nested loops are also allowed #pragma omp for for (i=0; i < N; i++) c[i] = a[i] + b[i]; example
Variables in parallel regions OpenMP includes a number of clauses to control how variables are shared or not among threads: private(var list) shared(var list) default(private) Create a separate memory space for the variables. The variables are not initialized, the programmer must initialize them inside the parallel region. All the threads will be able to modify and access the variable. This is the default (except for the loop index) The programmer may change the default behavior to private to avoid having to declare a lot of variables. #pragma omp for private(tmp) shared(a,b,c) for (i=0; i < N; i++) { tmp = 2 * a[i]; a[i] = tmp; b[i] = c[i] / tmp; }
Initializing / retaining private variables OpenMP also includes clauses to control how private variables can be initialized, and how to retain values after the parallel section: firstprivate(var list) Declares a variable private, and broadcasts the value the variable had before the beginning of the parallel section to all threads. lastprivate(var list) All the threads will be able to modify and access the variable. This is the default (except for the loop index) j = 1; #pragma omp parallel for firstprivate(j) for(i=0; i<size; i++) { a[i] = a[i] + j; } #pragma omp parallel for lastprivate(x) for(i=0; i<size; i++) { x = i / (SIZE-1); a[i] = x*x; } printf("last x = %g\n", x);
Parallel Reduction A parallel reduction is a very common operation so OpenMP includes a clause to avoid doing it explicitly C C++ reduction (operator : list) This can be used for example to calculate the sum of all values in a large array result = 0.0; #pragma omp for reduction(+:result) for (i=0; i < N; i++) result = result + (a[i] * b[i] ); printf("final result= %f\n",result); example
Order of execution Threads inside a parallel region will be executed in an arbitrary order. If required, the programmer can force a region of code to be executed sequentially, just like the serial version. This is done using the clause ordered: #pragma omp parallel for private(t) ordered for(i=0; i<size; i++) { t = func(i); #pragma omp ordered { printf( func(%d) = %g\n, i, t); } } There will be an ordered section in this loop The next section of the code will be executed in order of increasing loop index
Controlling the work done by each thread When using a parallel for section OpenMP will default to splitting the loop into equal chunks among available threads This may not always be the most efficient way to partition work: In some cases different iterations in a loop may have different workloads This leads to load imbalance and lower performance: the slowest thread dominates computing time OpenMP allows the programmer to control the distribution of iterations over the available threads using the schedule clause
Schedule clause C C++ schedule( type [, chunk_size]) Main schedule types: static - assigns the same number of iterations to each thread dynamic - assigns 1 iteration to each thread. When 1 thread finishes, it is assigned 1 more iteration until all iterations complete. Chunk size (optional) Controls the number of iterations assigned to each thread each time Defaults to Niter/Nthreads for static and 1 for dynamic #pragma omp for schedule(dynamic) for (i=0; i < N; i++) { a[i] = variable_work_load(i); }
Parallel region construct A parallel region is a block of code that will be executed by multiple threads C C++ #pragma omp parallel [clause...] newline if (scalar_expression) private (list) shared (list) default (shared none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression) {... } Can set the number of threads and control which variables are shared / private among threads When reaching a parallel directive the code creates a team of threads and the code is handled by all threads There is an implied barrier at the end of the parallel section
Critical / Barrier C C++ #pragma omp critical {... } #pragma omp barrier Used inside parallel regions The CRITICAL directive specifies a region of code that must be executed by only one thread at a time This allows the threads to avoid conflicts e.g. when writing to memory The BARRIER directive synchronizes all threads in the team No thread will continue until all threads reached the barrier
OpenMP library functions Include <omp.h> for compilation Not required for simple OpenMP programs Querying thread configuration omp_get_numthreads() - Gets the number of active threads inside a parallel region omp_get_thread_num() - Get unique thread id inside a parallel region omp_get_max_threads() - Get the default maximum number of threads in a program Function for timing your code omp_get_wtime() - Get elapsed time from a fixed time in the past, in seconds omp_get_wtick() - Get timer precision, in seconds
Overview
Overview OpenMP allows programmers to easily exploit multiple cores on shared memory systems It can be applied from modest laptops to high-end workstations OpenMP provide a standard/portable toolset for this computing paradigm Minimal learning curve / required hardware resources to begin parallel programming
Further Reading NCSA Introduction to OpenMP course https://www.citutor.org/login.php?course=24 Los Alamos National Laboratory OpenMP tutorial https://computing.llnl.gov/tutorials/openmp/ Parallel Programming in OpenMP R. Chandra, etc., Kaufmann Ed. Using OpenMP B. Chapman, etc., MIT Press
Examples & Projects
Example Programs I/II Source for examples can be found at https://sites.google.com/view/rafonseca2017/ OpenMP Fundamentals Hello World (helloworld.c) Parallel for construct (parallel_for.c) Private/shared variables (private_shared.c) Reduction (reduction.c) Clauses Initializing private variables (firstprivate.c) Retaining value of private variables (lastprivate.c) Ordered execution (ordered.c)
Example Programs II/II Parallel for scheduling Scheduling modes (schedule.c) Run the code with: The OpenMP pragmas commented out (serial execution) Static scheduling Dynamic scheduling Analyze the results
Project 1 Write a program that calculates matrix multiplication using OpenMP Implement C[n p] = A[n m] B[m p] (#rows #cols) Verify correctness (compare parallel vs. serial execution) Measure speedup for different matrix sizes A B C = Indexing : Arow,col C ij = m k= 1 A ik B kj
Project 2 Calculate π in parallel using: = Integrate using Euler s method Split the integration interval over available threads Calculate final result using OpenMP reduction 0 1 4 1 + x 2 dx