Practical in Numerical Astronomy, SS 2012 LECTURE 12

Practical in Numerical Astronomy, SS 2012 LECTURE 12 Parallelization II. Open Multiprocessing (OpenMP) Lecturer Eduard Vorobyov. Email: eduard.vorobiev@univie.ac.at, raum 006.6 1

OpenMP is a shared memory parallelism. It is designed for the SMP (symmetric multuprocessing) machines. Wikipedia: symmetric multiprocessing involves a multi-processor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. MPI is a distributed memory parallelism. It is designed for computer clusters with distributed memory. Wikipedia: distributed memory refers to a multiple-processor computer systems in which each processor has its own private memory. 2

Basic idea fork- join programming model 0 Serial region 1 2 Serial region Serial region 3 4 1) The code starts as serial (non-parallel) and it has only one master thread. 2) The master thread is forked into N threads where a parallel region is encountered (in this example four additional threads 1, 2, 3, 4 are created). Thread 0 remains the master of all five threads. 3) Each thread executes part of the code in parallel with other threads. 4) Upon completion of the parallel region, threads and joined into one master thread, which continues execution in the serial region. 5) Calculations continue in the serial mode until a new parallel region is reached. 3

Parallelizing a serial code using OpenMP directives. The OpenMP standard offers the possibility of using the same source code with and without OpenMP parallelization (the MPI standard does not do this!). This can only be achieved by hiding the OpenMP directives and commands in such a way, that a normal compiler is unable to see them. For that purpose the following directive sentinel is introduced:!$omp Since the first character is an exclamation mark!, a normal compiler will interpret the line as a comment and will neglect its content. But an OpenMP-compliant compiler will identify the complete sequence and will execute commands that follow:!$omp PARALLEL DEFAULT(shared) PRIVATE(C, D) REDUCTION(+:a) 4

Making the FORTRAN compile recognize OpenMP directives. In order for the FOTRAN compiler to recognize OpenMP directives, one needs to compile the source code with a specific flag, which may be compiler-dependent and tells the compiler to link specific OpenMP libraries. GNU Fortran compiler gfortran -fopenmp Intel Fortran compiler ifort -openmp PGI Fortran compiler pgf90 -mp Note that when using OpenMP all local arrays will be allocated on the stack. When porting existing code to OpenMP, this may lead to surprising results, especially to segmentation faults if the stacksize is limited. 5

Setting the number of threads in a parallel region The number of threads can be set by environment variables In BASH shell: export OMP_NUM_THREADS = 8 In TCSH shell setenv OMP_NUM_THREADS = 8 Environment variables affect all OpenMP codes that are run from a given terminal. 6

The number of threads can also be set by OpenMP library calls Subroutine OMPsetup integer omp_get_num_threads, omp_get_max_threads, omp_get_num_procs Call OMP_SET_NUM_Threads(8)! Sets number of threads to 8!$OMP parallel! Parallel region starts here!$omp master! The following commands will be executed only by the master thread print(*,*) 'num threads=', omp_get_num_threads()! Number of executing threads print(*,*) 'max threads=', omp_get_max_threads()! Maximum possible number of threads print(*,*) 'max cpus=', omp_get_num_procs()! Available number of processors!$omp end master!$omp end parallel! End of parallel regions End subroutine OMPSetup Note that OMP_SET_NUM_Threads is called from a serial part of the code. The library call to OMP_SET_NUM_Threads supersedes the environment variable OMP_NUM_THREADS. 7

The PARALLEL construct The most important directive in OpenMP is the one in charge of defining the so called parallel regions. Such a region is a block of code that is going to be executed by multiple threads running in parallel. Since a parallel region needs to be created/opened and destroyed/closed, two directives are necessary, forming a so called directive-pair:!$omp parallel --!$OMP end parallel.... serial code..!$omp parallel write(*,*) "Hello"!$OMP end parallel... serial code.. Parallel code Since the code enclosed between the two directives is executed by each thread, the message Hello appears in the screen as many times as threads are being used in the parallel region. Before and after the parallel region, the code is executed by only one thread, which is the normal behavior in serial programs. 8

Parallelizing a DO loop. PRIVATE clause Serial DO loop Integer k Do k = 1, 1000.. end do Parallel DO loop Integer k Call OMP_SET_NUM_Threads(2)!$OMP parallel do private(k) Do k = 1, 1000.. end do!$omp end parallel do thread 0 thread 0 thread 1 Do k = 1, 1000 Do k = 1, 500 Do k = 501, 1000 Master thread does all the job Each thread computes part of the global DO loop Note that the same counter variable k has different values in each thread in the parallelized DO loop! To avoid memory conflicts, two copies of variable k need to be created in the memory. The clause PRIVATE(k) tells the compiler that each thread needs to have its own copy of the variable k. The PRIVATE clause can be very resource consuming Variables should be declared private only if they are modified inside the DO loop. Upon entering and after leaving the parallel DO loop, variable k is undefined (in the serial DO loop, k=1000, after leaving the loop). 9

Program example Implicit NONE Shared variables. The SHARED clause In contrast to the previous situation, sometimes there are variables which should be available to all threads inside the DO-loop because their values are needed by all threads or because all threads have to update their values. Call omp_set_num_threads(4)! Setting the number of threads to 4 Integer, parameter :: n=10 Integer i Real b Real, dimension(n) :: a!$omp parallel do shared(a,n) private(i,b)! Parallel DO loop begins here Do i=1, n b = i + 1 a(i) = b End do!$omp end parallel do! Parallel DO loop ends here end In this example, an array variable a(i), variable b, and counter i are modified inside the DO loop. However, each iteration of the loop accesses different elements of the array a(i). Therefore, one need not to create separate copies of array a(i). Such variables are declared as SHARED. Use shared when: a variable is not modified in the loop (as, e.g., n) a variable is an array in which each iteration of the loop accesses a different element. 10

Other DO loop clauses FIRSTPRIVATE(list) LASTPRIVATE(list) REDUCTION(operator:list) SCHEDULE(type, chunk) ORDERED DEFAULT FIRSTPRIVATE clause. Private variables have an undefined value after entering the parallel do construct. But sometimes it is of interest that these local variables have the value of the original variable in the serial part of the code. This is achieved by including the variable in a FIRSTPRIVATE clause as follows: Integer a, b a = 2 b = 1!$OMP parallel do private(a) firstprivate(b)!$omp end parallel do In this example, variable a has an undefined value at the beginning of the parallel region, while b has the value specified in the preceding serial region, namely b = 1. 11

LASTPRIVATE clause Private variables have an undefined value after leaving the parallel do construct. This is sometimes not convenient. By including a variable in a LASTPRIVATE clause, the original variable will be updated by the last value it gets inside the parallel DO-loop, if this DOloop would be executed in serial mode. For example: Integer i, a!$omp do private(i) lastprivate(a) do i = 1, 1000 a = i End do!$omp end do After the finishing of the parallel DO loop, the variable a will be equal to 1000, which is the value it would have, if the OpenMP directive would not exist. 12

The REDUCTION clause Integer i, a do i = 1, 1000 a = a + i enddo wrong OmpenMP parallelization!!$omp parallel do private(i) shared (a) do i = 1, 1000 a = a + i enddo!$omp end do When a variable has been declared as SHARED because all threads need to modify its value, it is necessary to ensure that only one thread at a time is writing/updating the memory location of the considered variable, otherwise unpredictable results will occur. By using the clause REDUCTION it is possible to solve this problem, since only one thread at a time is allowed to update the value of a, ensuring that the final result will be the correct one.!$omp parallel do reduction(+:a) do i = 1, 1000 a = a + i Endd o!$omp end parallel do 13

General syntax of the REDUCTION clause REDUCTION(operator or intrinsic function : variable list) Initialization rules for variables in variable list A private copy of each variable in variable list is created for each thread as if the PRIVATE clause had been used. The resulting private copies are initialized following the rules shown in the Table. At the end of the REDUCTION, the shared variable is updated to reflect the result of combining the final value of each of the private copies using the specified operator. 14

The SCHEDULE clause. Load balancing. Call omp_set_num_threads(4)!$omp parallel do private(k) shared(n) Do k=1,n.. End do!$omp end parallel do When a do-loop is parallelized and its iterations distributed over the different threads, the most simple way of doing this is by giving to each thread the same number of iterations: n/4. But this is not always the best choice, since the computational cost of the iterations may not be equal for all of them. Therefore, different ways of distributing the iterations exist. The SCHEDULE clause is meant to allow the programmer to specify the scheduling for each do-loop using the following syntax: Call omp_set_num_threads(4)!$omp parallel do private(k) shared(n) schedule(type, chunk) Do k=1,n.. End do!$omp end parallel do 15

The SCHEDULE clause accepts two parameters. The first one, type, specifies the way in which the work is distributed over the threads. The second one, chunk, is an optional parameter specifying the size of the work given to each thread.. STATIC :when this option is specified, the pieces of work created from the iteration space of the do-loop are distributed over the threads in the team following the order of their thread identification number. This assignment of work is done at the beginning of the do-loop and stays fixed during its execution. Number of threads = 3 and the DO-loop iteration space k=1, 600 No value of chunk is specified. Best choice in most cases. 16

When SCHEDULE(DYNAMIC,chunk) is specified, the iteration space is divided into pieces of work with a size equal to chunk. If this optional parameter is not given, then a size equal to one iteration is considered. Thereafter, each thread gets one of these pieces of work. When they have finished with their task, they get assigned a new one until no pieces of work are left. Example of dynamic scheduling See also: GUIDED and RUNTIME clauses 17

The ODERED clause. Eliminating the race condition. Program race_condition Integer i Integer, dimension(5) :: a,b a=1 b=2 Call omp_set_num_threads(2)!$omp parallel do private(i) shared(a,b) Do i=1,4 a(i+1)=a(i)+b(i) End do!$omp end parallel do end Thread 0 a(2) = a(1)+b(1) Thread 0 a(3) = a(2)+b(2) Thread 1 a(4) = a(3)+b(3) Thread 1 a(5) = a(4)+b(4) We have a data dependency between iterations, causing a so-called race condition P R O B L E M A solution is to use the ORDERED clause, which tell the compiler that some statements in the DO-loop need to be executed sequentially. 18

Program no_race_condition Integer i Integer, dimension(5) :: a,b a=1 b=2 Call omp_set_num_threads(2)!$omp parallel do private(i) shared(a,b) ordered Do i=1,4!$omp ordered a(i+1)=a(i)+b(i)!$omp end ordered End do!$omp end parallel do end In this case, the threads do not run in parallel. DEFAULT( PRIVATE SHARED NONE ) clause When most of the variables used inside the DO-loop are going to be private/shared, then it would be cumbersome to include all of them in one of the previous clauses. To avoid this, it is possible to specify what OpenMP has to do, when nothing is said about a specific variable: it is possible to specify a default setting. For example!$omp parallel do default(private) shared(a) 19

Parallelization of implicit DO-loops. WORKSHARE construct. FORTRAN 90 array operations include implicit DO-loops and can be parallelized by the WORKSHARE construct serial code real, dimension (10):: a, b, c.. a = 5.0 * cos(a) + 4.0 * sin(a).. parallelized code real, dimension (10):: a, b, c..!$omp parallel workshare a = 5.0 * cos(a) + 4.0 * sin(a)!$omp end parallel workshare.. Not all compilers support parallelization of FORTRAN 90 array operations! 20

Parallelization of nested DO-loops When several nested do-loops are present, it is always convenient to parallelize the outer most one, since then the amount of work distributed over the different threads is maximal. Also the number of times in which the!$omp parallel do --!$OMP end parallel do directive pair effectively acts is minimal, which implies a minimal overhead due to the OpenMP directive. do i = 1, 10 do j = 1, 10!$OMP parallel do private(k) shared(a,j,i) do k = 1, 10 A(k,j,i) = i * j * k end do!$omp end parallel do end do end do!$omp parallel do private(i,j,k) shared(a) do i = 1, 10 do j = 1, 10 do k = 1, 10 A(k,j,i) = i * j * k end do end do end do!$omp end parallel do the work to be computed in parallel is distributed i *j = 100 times and each thread gets less than 10 iterations to compute, since only the innermost do- loop is parallelized. the work to be computed in parallel is distributed only once and the work given to each thread has at least j*k = 100 iterations. Therefore, in this second case a better performance of the parallelization is to expect. 21

The SECTIONS construct The SECTIONS construct allows to assign to each thread a completely different task leading to an MPMD 1 model of execution. Each section of code is executed once and only once by a thread in the team. The syntax of this construct is the following one:!$omp parallel sections clause1 clause2...!$omp section... code executed by one thread!$omp section... code executed by another thread!$omp end parallel sections Each block of code, to be executed by one of the threads, starts with an!$omp SECTION directive and extends until the same directive is found again or until the closing-directive!$omp END SECTIONS is found. Any number of sections can be defined inside the present directive-pair, but only the existing number of threads is used to distribute the different blocks of code. This means, that if the number of sections is larger than the number of available threads, then some threads will execute more than one section of code in a serial fashion. Allowed clauses: PRIVATE, FIRSTPRIVATE, LASTPRIVATE, REDUCTION 1 MPMD stands for Multiple Programs Multiple Data and refers to the case of having completely different programs/tasks which share or interchange information and which are running simultaneously on different processors. 22

Calling serial subroutines inside a parallel region. SINGLE construct. integer, dimension(0:3) :: a = 99 integer :: i_am Call omp_set_num_threads(4)!$omp parallel private(i_am) shared(a) i_am = omp_get_thread_num() call work(a, i_am)!$omp single print*, 'a = ', a!$omp end single!$omp end parallel subroutine work(a, i_am) integer, dimension(0:3) :: a! becomes shared integer :: i_am! becomes private print*, 'work', i_am a(i_am) = i_am end subroutine work Dummy arguments inherit the data-sharing attributes of the associated actual arguments. The code enclosed in the SINGLE construct is only executed by one of the threads in the team, namely the one who first arrives to the opening-directive!$omp SINGLE. All the remaining threads wait at the implied synchronization in the closing-directive!$omp END SINGLE. Result of execution work 1 work 3 a = 99, 1, 99, 3 work 2 work 0 What went wrong? The SINGLE construct was executed by one of the threads (1 or 3) before threads 2 and 0 completed execution of subroutine work. 23

integer, dimension(0:3) :: a = 99 integer :: i_am Call omp_set_num_threads(4)!$omp parallel private(i_am) shared(a) i_am = omp_get_thread_num() call work(a, i_am)!$omp barrier! All threads wait at the barrier!$omp single print*, 'a = ', a!$omp end single!$omp end parallel subroutine work(a, i_am) integer, dimension(0:3) :: a! becomes shared integer :: i_am! becomes private print*, 'work', i_am a(i_am) = i_am end subroutine work Result of execution work 1 work 3 work 2 work 0 a = 0, 1, 2, 3 The BARRIER directive represents an explicit synchronization between the different threads in the team. When encountered, each thread waits until all the other threads have reached this point. 24

Calling parallel subroutines inside a parallel region. Call omp_set_num_threads(2)!$omp parallel shared(s) private(p)!$omp do private(j) do j = 1, 10... end do!$omp end do call sub(s, p)!$omp end parallel... end subroutine sub(s, p) integer :: s! shared integer :: p! private integer :: var, k! local variables are private!$omp do private(k) do k = 1, 10...! Thread 0 will do the first 5 iterations...! Thread 1 will do the last 5 iterations end do!$omp end do do k = 1, 10...! All threads will do full 10 iterations end do A PARALLEL directive dynamically inside another PARALLEL directive logically establishes a new team, which is composed of only the current thread, unless nested parallelism is established. We say that the loop is serialized. All threads perform six iterations each.!$omp parallel do private(k) do k = 1, 10...! A PARALLEL directive inside! another PARALLEL directive end do!$omp end parallel do end 25

The MASTER and CRITICAL constructs The code enclosed inside the MASTER construct is executed only by the master thread of the team. Meanwhile, all the other threads continue with their work. The syntax is as follows:!$omp master...!$omp end master In essence, this construct is similar to using the!$omp single --!$OMP end single construct presented before, only that the thread to execute the block of code is forced to be the master one instead of the first arriving one. The CRITICAL construct restricts the access to the enclosed code to only one thread at a time. Examples of application of this directive-pair could be to read an input from the keyboard/file or to update the value of a shared variable. The syntax is the following one:!$omp critical...!$omp end critical When a thread reaches the beginning of a critical section, it waits there until no other thread is executing the code in the critical section. 26

The THREADPRIVATE construct Sometimes it is of interest to have global variables, but with values which are specific for each thread. An example could be a variable called my_id, which stores the thread identification number of each thread: this number will be different for each thread, but it would be useful that its value is accessible from everywhere inside each thread and that its value does not change from one parallel region to the next. When the program enters the first parallel region, a private copy of each variable marked as THREADPRIVATE is created for each thread. integer, save :: my_id! Variable must have a SAVE attribute!$omp threadprivate(my_id)!$omp parallel my_id = OMP_get_thread_num()! Thread number is assigned to my_id!$omp end parallel..!$omp parallel...!$omp end parallel. In this example, the variable my_id gets assigned the thread identification number of each thread during the first parallel region. In the second parallel region, the variable my_id keeps the values assigned to it in the first parallel region, since it is THREADPRIVATE. 27

OpenMP runtime library overview OpenMP Fortran library routines are external functions Their names start with OMP_ but usually have an integer or logical return type These functions must be declared explicitly Name omp_set_num_threads omp_get_num_threads omp_get_max_threads omp_get_thread_num omp_get_num_procs omp_in_parallel omp_set_dynamic omp_get_dynamic omp_set_nested omp_get_nested Functionality Set number of threads Return number of threads in team Return maximum number of threads Get thread ID Return maximum number of processors Check whether in parallel region Activate dynamic thread adjustment Check for dynamic thread adjustment Activate nested parallelism Check for nested parallelism 28

References : www.openmp.org OpenMP Application Program Interface Version 3.0 May 2008 ALSO various web resources and books 29

Assignment 9 (five extra points) Parallelize your version of the Sedov test problem (or Sod shock tube problem) using OpenMP directives. (see Nigel s lecture on hyperbolic equations, Assignment 6). Use sufficiently high resolution so that the serial code would run 1 minute minimum. Use different number of threads (2, 4, max) Calculate the speedup for variable number of threads (2, 4, max) relative to the purely serial code. (use time./your _code in Linux to calculate the run time of your code) The report is due on 12.07.2012. 30