Lecture on Scientific Computing Dr. Kersten Schmidt Lecture 20 Technische Universität Berlin Institut für Mathematik Wintersemester 2014/2015
Syllabus Linear Regression, Fast Fourier transform Modelling by partial differential equations (PDEs) Maxwell, Helmholtz, Poisson, Linear elasticity, Navier-Stokes equation boundary value problem, eigenvalue problem boundary conditions (Dirichlet, Neumann, Robin) handling of infinite domains (wave-guide, homogeneous exterior: DtN, PML) boundary integral equations Computer aided-design (CAD) Mesh generators Space discretisation of PDEs Finite difference method Finite element method Discontinuous Galerkin finite element method Solvers Linear Solvers (direct, iterative), preconditioner Nonlinear Solvers (Newton-Raphson iteration) Eigenvalue Solvers Parallelisation Computer hardware (SIMD, MIMD: shared/distributed memory) Programming in parallel: OpenMP, MPI VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 2
Shared memory computer A process is an instance of a program that is executing more or less autonomously on a physical processor. A thread is a sequence of instructions. Several threads may share resources (CPU, memory). Threads of a process share its instructions and their variables. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 3
Shared-memory parallel programming model Static process generation: the number of threads is fixed a-priori by the programmer. Dynamic process generation: fork / join parallelism. There is a master process that forks slave processes / threads as needed in a parallel region. At the end of the parallel region the slave processes may be killed. The number of threads may vary from one parallel region to another. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 4
OpenMP is an application programming interface that provides a parallel programming model for shared memory and distributed shared memory multiprocessors, extends programming languages (C/C++ and Fortran) by a set of compiler directives to express shared memory parallelism (they are called pragmas in C), runtime library routines and environment variables that are used to examine and modify execution parameters. There is a standard include file omp.h for C/C++ OpenMP programs. With gcc/g++ compile with flag -fopenmp, e.g. gcc -fopenmp example.c -o example OpenMP is becoming the de facto standard for parallelising applications for shared memory multiprocessors. OpenMP is independent of the underlying hardware and operating system. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 5
First OpenMP demo program in C: omp1st.c #include <stdio.h> #include <omp.h> int main() { #pragma omp parallel { int myid = omp_get_thread_num(); int num = omp_get_num_threads(); printf("thread %d from %d is ready.\n", myid, num); Calling % gcc -fopenmp omp1st.c -o omp1st % export OMP_NUM_THREADS=4 % omp1st Thread 2 from 4 is ready. Thread 3 from 4 is ready. Thread 0 from 4 is ready. Thread 1 from 4 is ready. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 6
Another OpenMP demo program in C: omp2nd.c #include <stdio.h> #include <omp.h> int main() { omp_set_num_threads(2); #pragma omp parallel { int myid = omp_get_thread_num(); int num = omp_get_num_threads(); Calling % gcc -fopenmp omp2nd.c -o omp2nd % omp2nd Thread 0 from 2 is ready. Thread 1 from 2 is ready. Thread 2 from 3 is ready. Thread 0 from 3 is ready. Thread 1 from 3 is ready. printf("thread %d from %d is ready.\n", myid, num); printf("\n"); omp_set_num_threads(3); #pragma omp parallel { int myid = omp_get_thread_num(); int num = omp_get_num_threads(); printf("thread %d from %d is ready.\n", myid, num); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 7
How is OpenMP typically used? to parallise loops find your most time consuming loops and split them up between threads OpenMP parallel control structures that fork new threads: The parallel directive is used to create multiple threads that execute concurrently. It applies to a structured block, i.e. in C/C++ between { and. The for directive is used to express loop-level parallelism. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 8
An example SAXPY : Y αx + Y 1. The sequential program for(i = 0; i < N; ++i){ y[i] = alpha * x[i] + y[i]; 2. OpenMP parallel region #pragma omp parallel { int id, i, num, istart, iend; id = omp_get_thread_num(); num = omp_get_num_threads(); istart = id*n/num; iend = (id + 1)*N/num; for (i = istart; i < iend; ++i) { y[i] = alpha * x[i] + y[i]; 3. OpenMP parallel region combined with a for-directive #pragma omp parallel #pragma omp for for(i = 0; i < N; ++i){ y[i] = alpha * x[i] + y[i]; or short #pragma omp parallel for for(i = 0; i < N; ++i){ y[i] = alpha * x[i] + y[i]; VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 9
OpenMP communication The threads share all global variables, variables on the stack or dynamically allocatecd variables (on heap), except... if variables are indicated as private. The (possibly different) values for each thread are stored at multiple locations. OpenMP synchronization Threads can easily communicate with each other through reads and writes of shared variables. However, this has to be done at the right time. Example: Two threads should some value add to a variable Two forms of process synchronization or coordination are mutual exclusion : use the critical directive to indicate that threads have to read and write variables after each other. event synchronization : use the barrier directive to define a point where each thread waits for all other threads to arrive. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 10
Example: dot product : dotproduct_i.c #define NUM_THREADS 2 int main() { long N = 100; long i; double sum = 0.0, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; #pragma omp parallel for for(i = 0; i < N; ++i) sum = sum + x[i]*y[i]; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); % gcc -fopenmp dotproduct_i.c \ -o dotproduct_i %./dotproduct_i 3240?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 2775?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 4950?= 4950 %./dotproduct_i 4950?= 4950 What is the problem here? Why do we obtain different results if we run the code several times? Unintended sharing of variables (here sum) can lead to race conditions: the results (may) change as the threads are scheduled differently. VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 11
Example: dot product : dotproduct_ii.c #define NUM_THREADS 2 Right result? int main() { long N = 500000; long i; double sum = 0.0, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; #pragma omp parallel for for(i = 0; i < N; ++i) { #pragma omp critical { sum = sum + x[i]*y[i]; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 12
Example: dot product : dotproduct_iii.c #define NUM_THREADS 2 int main() { long N = 500000, i; This is already much better. There are now only NUM_THREADS points of synchronization. double sum = 0.0, local_sum, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; #pragma omp parallel private (local_sum) { local_sum = 0.0; #pragma omp for for(i = 0; i < N; ++i) { local_sum = local_sum + x[i]*y[i]; #pragma omp critical { sum = sum + local_sum; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 13
Example: dot product : dotproduct_iv.c #define NUM_THREADS 2 int main() { long N = 500000, i; double sum = 0.0, x[n], y[n]; omp_set_num_threads(num_threads); #pragma omp parallel for for(i = 0; i < N; ++i) { y[i] = 1.0; x[i] = (double)i; The most elegant way how to deal with the above problem is to give the variable sum the reduction attribute. Here, the compiler implicitely creates a local copy of sum. At the end of the loop, the local copies are added by reduction operation in an optimal way. Remark: Reduction operations are possible with the operators +, -, *, &,, ^, &&,. #pragma omp parallel for reduction (+ : sum) for(i = 0; i < N; ++i) { sum = sum + x[i]*y[i]; printf("%12.0f?= %ld\n", sum, (N-1)*N/2); VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 14
Data dependence Whenever a statement in a program reads and writes a memory location and another statement reads or writes the same memory location, and at least one of the two statements writes the location, then there is a data dependence on that memory location between the two statements. Loop-carried data dependence if the two statements involved occur in different iterations of a loop. Removing data dependencies in algorithms, examples: Copying of values in auxilliary variables (arrays) change loops such that outer loop is longest, inner loop with critical directive VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/27/2015 15