Introduction to High Performance Computing and Optimization

Size: px
Start display at page:

Download "Introduction to High Performance Computing and Optimization"

Transcription

1 Institut für Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13

2 Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code 3.1 Performance Measurement 3.2 Optimization Guidelines 3.3 Compiler-Aided Optimization 3.4 Combine example 3.5 Further Optimization Issues 4. Parallel Computing 4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks 5. OpenMP Programming Oliver Ernst (INMO) HPC Wintersemester 2012/13 1

3 Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code 4. Parallel Computing 5. OpenMP Programming Oliver Ernst (INMO) HPC Wintersemester 2012/13 195

4 Background Application Programming Interface (API) for shared-memory programming; supports C/C++ and Fortran, all architectures; implemented by many compilers; managed by OpenMP Architecture Review Board (ARB). Initiated by vendors of SMP systems in late 1990s to provide portable language extensions for programming SMP architectures. Introduced 1997, current specification is Version 3.1 (July 2011); Version 4.0 exists as release candidate (November 2012). openmp.org Feature set: OpenMP API comprises a set of compiler directives, runtime libraries and environment variables. These allow programmes to create teams of threads for parallel execution, specify how to share work among the members of a team, declare both shared and private variables, and synchronize threads and enable them to perform certain operations exclusively (i.e., without interference by other threads). Oliver Ernst (INMO) HPC Wintersemester 2012/13 196

5 Literature R. Chandra et al. Parallel Programming in OpenMP. Morgan Kaufmann, 2000 B. Chapman, G. Jost and R. van der Pas. Using OpenMP. MIT Press, 2007 OpenMP standard specification (currently 3.1) Available from openmp.osrg Manual of GNU OpenMP implementation libgomp Oliver Ernst (INMO) HPC Wintersemester 2012/13 197

6 Basic Model initial thread (serial region) Fork Join team of threads in (one or more) parallel regions initial thread (serial region) Work shared among cooperating threads. Shared and private memory OpenMP handles many basic system tasks (as opposed to, say, POSIX threads or Intel s Thread Building Blocks, TBB) Incremental parallelization of given serial code possible. Oliver Ernst (INMO) HPC Wintersemester 2012/13 198

7 Processes and threads Process: operating systems term (the older of the two); essentially the code of a program combined with state information. A process can consist of multiple threads; these share a common address space. Threads have their own program counter (instruction pointer) and stack. Threads are lightweight processes in the sense that they can be started (spawned) with much less overhead than starting (forking) a new process. Thread communication is cheaper (via the shared memory) than inter-process communication. In OpenMP: Each thread has unique thread ID, can be queried by API function omp_get_thread_num(). Number of parallel threads set by environment variable. export OMP_NUM_THREADS =4 Oliver Ernst (INMO) HPC Wintersemester 2012/13 199

8 Data Scoping By default: variables existing before parallel region still exist inside and are shared by all threads. Three ways to make variables private: (1) Apply private clause to existing variable, creating separate instance of this variable for each thread. (2) Index variable for work-sharing loop. (3) Local variables in subroutine called from private region private to each thread. Note: local variables with C storage class static remain shared. In C: private clause applies to entire structured block and thus also to variables declared there. 1 # pragma omp parallel 2 { 3 int bstart, bend, blen, numth, tid, i; 4... // calculate loop boundaries 5 for (i= bstart ; i <= bend ; i ++) 6 a[ i] = b[ i] + c[ i]; 7 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 200

9 Directives include header omp.h for function prototypes. #pragma directives: control the actions of the compiler without affecting the program as a whole. OpenMP directives in C/C++ inserted via #pragma omp <directive [clause]>. Conditional compilation: activating compiler s OpenMP switch sets preprocessor macro # ifdef _OPENMP... do something # endif Directive # pragma omp parallel < structured block > Makes the structured block a parallel region: All code executed between start and end of this region is executed by all threads. This includes subroutine calls within the region (unless explicitly sequentialized) Local variables inside the block are automatically private to each thread Oliver Ernst (INMO) HPC Wintersemester 2012/13 201

10 The loop construct 1 # pragma omp for 2 <for loop > Causes the iterations of the loop immediately following to be executed in parallel. Restricted to loops for which number of iterations known a priori. 1 # pragma omp for 2 for ( i =0; i<n; i ++) 3 printf (" Thread % d executes loop iteration % d\ n", omp_get_thread_num (),i); Example of parallel for loop. Note: Only executes in parallel if enclosed in parallel region. Otherwise say 1 # pragma omp parallel for 2 for ( i =0; i<n; i ++) 3 printf (" Thread % d executes loop iteration % d\ n", omp_get_thread_num (),i); Oliver Ernst (INMO) HPC Wintersemester 2012/13 202

11 The loop construct Example output for n = 9 and 4 threads: 1 Thread 0 executes loop iteration 0 2 Thread 0 executes loop iteration 1 3 Thread 0 executes loop iteration 2 4 Thread 3 executes loop iteration 7 5 Thread 3 executes loop iteration 8 6 Thread 2 executes loop iteration 5 7 Thread 2 executes loop iteration 6 8 Thread 1 executes loop iteration 3 9 Thread 1 executes loop iteration 4 Note: here no assignment of iterations to threads was specified. In this case it is the compiler s decision. Oliver Ernst (INMO) HPC Wintersemester 2012/13 203

12 Example Quadrature π = x 2 dx 1 n n j=1 f(x j ), x j = j 1 2 n f (x) = 4 1+x f (x) x Oliver Ernst (INMO) HPC Wintersemester 2012/13 204

13 Example Quadrature 1 double rect ( int N) 2 { 3 double h, I = 0, x; 4 int j; 5 6 h = 1.0 / N; 7 for (j =0; j < N; j ++) { 8 x = (j +0.5) *h; 9 I += (4.0 / (1.0 + x * x)); 10 } 11 return I * h; 12 } C code for approximating Pi by quadrature Each loop iteration independent. Divide this work across several threads. Oliver Ernst (INMO) HPC Wintersemester 2012/13 205

14 Example Quadrature 1 # include <omp.h> 2 3 double rect ( int N) 4 { 5 double h, I=0.0, x, sum ; 6 int j; 7 8 h = 1.0/ N; 9 # pragma omp parallel private (x, sum ) 10 { 11 sum = 0.0; 12 # pragma omp for 13 for (j =0; j < N; j ++) { 14 x = (j +0.5) *h; 15 sum += (4.0 / (1.0 + x * x)); 16 } 17 # pragma omp critical 18 I = I + h* sum ; 19 } 20 return I; 21 } C code for approximating Pi by quadrature with OpenMP directives Oliver Ernst (INMO) HPC Wintersemester 2012/13 206

15 Clauses Some OpenMP directives also support clauses. In quadrature example: private(x,sum) appears as a clause to the parallel directive. It declares the variables x and sum to be private to each thread. Clauses supported by the loop construct: private(<list>) firstprivate(<list>): pre-initialize variables with value of shared variable of same name before parallel region. lastprivate(<list>) make value of private variable from last iteration after parallel region. reduction(<operator>:<list>): applies reduction operator to variable after parallel region. ordered: force execution in order. schedule(<kind[,chunk size]>): allows control over assignment of iterations to threads. nowait: omit implied barrier at end of parallel region. Oliver Ernst (INMO) HPC Wintersemester 2012/13 207

16 Synchronization: barrier construct When several threads write to the same shared variable, a race condition results if no serialization is imposed. OpenMP supplies the latter with the critical construct. 1 # pragma omp critical 2 < structured block > The structured block is declared to be a critical section of code which may be entered only by one thread at a time. Badly arranged critical regions can lead to deadlocks: threads waiting for events which never occur. Reason: when a thread encounters a critical directive inside a critical region it will block forever. Solution: names for critical directives. Only critical sections with same name are locked. 1 # pragma omp critical ( <name >) 2 < structured block > Oliver Ernst (INMO) HPC Wintersemester 2012/13 208

17 Synchronization: barrier construct When several threads write to the same shared variable, a race condition results if no serialization is imposed. OpenMP supplies the latter with the critical construct. 1 # pragma omp critical 2 < structured block > The structured block is declared to be a critical section of code which may be entered only by one thread at a time. Badly arranged critical regions can lead to deadlocks: threads waiting for events which never occur. Reason: when a thread encounters a critical directive inside a critical region it will block forever. Solution: names for critical directives. Only critical sections with same name are locked. 1 # pragma omp critical ( <name >) 2 < structured block > Oliver Ernst (INMO) HPC Wintersemester 2012/13 208

18 Synchronization: barrier construct Example: 1 # pragma omp parallel for private ( x) 2 3 for (i =1; i <=N; i ++) { 4 x = sin (2* M_PI *x/n); 5 # pragma omp critical ( psum ) 6 sum += func (x); 7 } double func ( double v) { 10 # pragma omp critical ( prand ) 11 func = v + random_func (); 12 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 209

19 Synchronization: barrier construct 1 # pragma omp barrier Causes a thread to wait until all remaining threads have reached this point in the code Applies to the innermost enclosing parallel region. Many OpenMP constructs contain an implied barrier (added by compiler). Potential for deadlocks if not all threads reach this statement. Sequence of work-sharing regions and barrier regions must be the same for all threads. Most often used to avoid a data race condition: barrier between write and subsequent read from shared variable. Note: use only where necessary as barriers cause synchronization overhead. Oliver Ernst (INMO) HPC Wintersemester 2012/13 210

20 Reduction clause 1 # pragma omp parallel for reduction (+: sum ) 2 for ( i =0; i<n; i ++) 3 sum += a[ i]; Associative and commutative mathematical operations like +,, used to accumulate private variables into one shared variable. Programmer specifies operation and variable which holds result. This is shared after exiting the parallel region. Usually faster than manual implementation. Automatic initialization of private reduction variables with appropriate values (e.g. 0 for +, 1 for ). (Note: danger of producing incorrect serial code!) Set of supported operations slightly different between C/C++ and Fortran. Oliver Ernst (INMO) HPC Wintersemester 2012/13 211

21 Schedule clause 1 # pragma omp parallel for schedule ( static ) 2 for ( i =0; i<n; i ++) 3 a[i] = calculate (i); Used to assign loop iterations to threads. Syntax: schedule(<kind>[,<chunk-size>]) Granularity of workload distribution determined by expression chunk-size. Chunk: contiguous, nonempty subrange of iteration index. 4 kinds: static : cyclically in order of thread id dynamic : as threads request them; default chunk-size is 1. guided : as in dynamic, but chunk-size proportional to remaining number of iterations divided by number of threads. runtime : scheduling determined at runtime as set in OMP_SCHEDULE environment variable. Oliver Ernst (INMO) HPC Wintersemester 2012/13 212

22 Schedule clause Example: Different assignments of 20 iterations of a loop to 3 threads using schedule clause. Iteration T0 T1 T2 STATIC STATIC,3 DYNAMIC[,1] DYNAMIC,3 GUIDED[,1] source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 213

23 Single construct 1 # pragma omp single 2 < structured block > Specifies that structured block immediately following should be executed by one thread only. Which thread not specified and will vary across runs (and across single constructs). Oliver Ernst (INMO) HPC Wintersemester 2012/13 214

24 Single construct (example) 1 # pragma omp parallel shared (a, b) private ( i) 2 { 3 # pragma omp single 4 { 5 a = 10; 6 printf (" Single construct executed by thread % d\ n", omp_get_thread_num ()); 7 } /* Barrier automatically inserted here */ 8 # pragma omp for 9 for ( i =0; i<n; i ++) 10 b[ i] = a; 11 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 215

25 Task construct 1 # pragma omp task [< clause >[[,] < clause >]...] 2 < structured block > Added with OpenMP standard 3.0 to parallelize collections of tasks not easily expressed as a loop over an index set. Task construct: task directive plus structured block. Thread encountering parallel construct packages up a set of tasks, one per thread. Team of therads is created Each thread in the team assigned one of the tasks. Implicit barrier holds master thread until all implicit tasks completed. Oliver Ernst (INMO) HPC Wintersemester 2012/13 216

26 Task construct: example 1 2 # pragma omp parallel 3 { 4 # pragma omp single private ( p) 5 { 6 p = listhead ; 7 while (p) { 8 # pragma omp task 9 process ( p); // p is firstprivate by default 10 p= next (p); 11 } 12 } 13 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 217

27 Conditional compilation Issue: different code depending on whether OpenMP active or not. OpenMP directives automatically ignored. Calls to OpenMP API functions can be masked out with C/C++ preprocessor symbol _OPENMP. Oliver Ernst (INMO) HPC Wintersemester 2012/13 218

28 Memory consistency 1 myid = 0; 2 numthreads = 1; 3 # ifdef _OPENMP 4 # pragma omp parallel private ( myid ) 5 { 6 myid = omp_get_threadnum (); 7 # pragma omp single 8 numthreads = omp_get_num_threads (); 9 # pragma omp critical 10 printf (" Parallel program, this is thread % d of % d.\ n", myid, numthreads ); 11 } 12 # else 13 printf (" Serial program.\n") 14 # endif Call to get_num_threads() in critical region since numthreads shared. For correct values to be printed all threads must wait until thread executing single region has completed (implicit barrier after single ensures this). Variables held in registers written out so that cache coherency can guarantee memory consistency (cf. OpenMP flush directive). Oliver Ernst (INMO) HPC Wintersemester 2012/13 219

29 Thread safety Print statement in line 10 serialized (protected by critical region) to prevent multiple treads output streams interfering. General rule: I/O operations, general OS functionality, common library functions should be serialized because they may not be thread safe. Prominent example: rand() function from the C library, as it uses a static variable to store its hidden state (the seed). Oliver Ernst (INMO) HPC Wintersemester 2012/13 220

30 Affinity OpenMP standard provides no way to bind threads to cores. No provisions for locality constraits. Can t rely on OS to do this well. Remedy: OS-level tools such as likwid. Oliver Ernst (INMO) HPC Wintersemester 2012/13 221

31 Case study: Jacobi relaxation One of the oldest iterative methods for solving a linear system of equations Ax = b, A R N N, b R N, is Jacobi s method: given an approximation x m A 1 b, a new approximation x m+1 is obtained by updating each component of x m by assuming all remaining components are correct: for the i-th equation: one obtains a i,1 [x m ] 1 + a i,2 [x m ] a i,n [x m ] N = [b] i, 1 i N, [x m+1 ] i = ( [b] i N a i,j [x m ] j )/a i,i. j=1 j i In matrix notation, setting D := diag(a), the complete update reads x m+1 = x m + D 1 (b Ax m ). Oliver Ernst (INMO) HPC Wintersemester 2012/13 222

32 Case study: Jacobi relaxation When used to solve linear systems arising from the discretization of PDEs, such iterations are called relaxation methods. 5-point FD discretization of u = f on Ω = (0, 1) 2 u Ω = 0, on uniform grid (x i, y j ) of mesh width h = 1/(n + 1) leads to system Au = f with f i,j = f(x i, y j ), u i,j u(x i, y j ) and y x h [Au] i,j = 4u i,j u i 1,j u i+1,j u i,j 1 u i,j+1 h 2. Oliver Ernst (INMO) HPC Wintersemester 2012/13 223

33 Case study: Jacobi relaxation One step of Jacobi relaxation thus updates each component [u m ] i,j of u m as [u m+1 ] i,j = h2 4 [f ] i,j + [u m] i 1,j + [u m ] i+1,j + [u m ] i,j 1 + [u m ] i,j+1. 4 In MATLAB: 1 G = numgrid ( S,n +2) ; h = 1/( n +1) ; 2 A = 1/h^2 * delsq (G); f = ones (n ^2,1) ; 3 u_ex = A\ f; 4 5 u = zeros ( size (f)); r = f - A*u; 6 D = spdiags ( diag (A),0,n^2,n ^2) ; 7 for j =1:100 8 u = u + D\ r; 9 r = f - A* u; 10 errnorm (j) = norm (u_ex -u)/ norm ( u_ex ); 11 end Oliver Ernst (INMO) HPC Wintersemester 2012/13 224

34 Case study: Jacobi relaxation Not the fastest linear solver relative error 10 1 N=10 N=20 N= m but still a common computational pattern (stencil-based computation, regular mesh). Oliver Ernst (INMO) HPC Wintersemester 2012/13 225

35 Case study: Jacobi relaxation 1 while ( maxdelta > eps && it < 5) { 2 maxdelta = localmax = 0.; it ++; 3 # pragma omp parallel private (i,k, localmax,tmp ) 4 { 5 # pragma omp for 6 for (i = 1; i <= N; i ++) { 7 for (k = 1; k <= N; k ++) { 8 // four flops, one store, four loads 9 phi [t1 ][i][k] = ( phi [t0 ][i +1][ k] + phi [t0 ][i -1][ k] + 10 phi [t0 ][i][k+1] + phi [t0 ][i][k -1] ) * 0.25; 11 if ( phi [t1 ][i][k] < phi [t0 ][i][k]) 12 tmp = phi [t0 ][i][k] - phi [t1 ][i][k]; 13 else 14 tmp = phi [t1 ][i][k] - phi [t0 ][i][k]; 15 if ( localmax < tmp ) 16 localmax = tmp ; 17 // localmax = fmax ( localmax, fabs ( phi [t1 ][i][k]- phi [t0 ][i][k]) ); 18 } 19 } 20 # pragma omp flush ( maxdelta ) 21 if ( localmax > maxdelta ) { 22 # pragma omp critical 23 if ( localmax > maxdelta ) 24 maxdelta = localmax ; 25 } 26 } 27 i = t0; t0 = t1; t1 = i; // t0 ^= t1 ; t1 ^= t0 ; t0 ^= t1 ; // swap arrays 28 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 226

36 Case study: Jacobi relaxation Hager and Wellein s timings: Performance [MLUPs/sec] thread 2 threads, 1 socket 2 threads, 2 sockets 4 threads 4 MB L2 8 MB L2 P P P P socket 32k L1D 32k L1D 32k L1D 32k L1D 4MB L2 4MB L2 Chipset Memory N Oliver Ernst (INMO) HPC Wintersemester 2012/13 227

37 Case study: Jacobi relaxation Our timings: MLUP/s thread 2 threads, 1 socket 2 threads, 2 sockets 3 threads 4 threads 2D Jacobi OpenMP on klio N Oliver Ernst (INMO) HPC Wintersemester 2012/13 228

38 Case study: Sparse matrix-vector multiplication Many scientific computing problems (PDEs, optimization, graph analysis) feature matrices which are sparse, i.e., contain only a small number of nonzero entries in each row/column. Exploiting this fact requires storing these matrices in data structures which store (essentially) only nonzero entries and developing algorithms for matrix operations for these data structures. The resulting sparse matrix format dramatically reduces operation and storage complexity for operations with sparse matrices. Different formats have been developed reflecting specific application properties and computer hardware behavior. We compare two with regard to OpenMP parallelization on current SMP systems. Oliver Ernst (INMO) HPC Wintersemester 2012/13 229

39 Case study: Sparse matrix-vector multiplication Many scientific computing problems (PDEs, optimization, graph analysis) feature matrices which are sparse, i.e., contain only a small number of nonzero entries in each row/column. Exploiting this fact requires storing these matrices in data structures which store (essentially) only nonzero entries and developing algorithms for matrix operations for these data structures. The resulting sparse matrix format dramatically reduces operation and storage complexity for operations with sparse matrices. Different formats have been developed reflecting specific application properties and computer hardware behavior. We compare two with regard to OpenMP parallelization on current SMP systems. Oliver Ernst (INMO) HPC Wintersemester 2012/13 229

40 Case study: Sparse matrix-vector multiplication In CRS (compressed row storage), also known as CSR (compressed sparse row), a (linear) array val of floating point numbers stores the nonzero matrix entries row-wise; a corresponding integer array col_idx contains the column indices of the nonzero matrix entries and a second integer array row_ptr contains for each row the index in val and col_idx of this row s first entry. Oliver Ernst (INMO) HPC Wintersemester 2012/13 230

41 Case study: Sparse matrix-vector multiplication Example: for the sparse matrix A = the CRS representation is val row_ptr col_idx Oliver Ernst (INMO) HPC Wintersemester 2012/13 231

42 Case study: Sparse matrix-vector multiplication CRS sparse matrix-vector product y Ax (n denotes # rows) 1 for (i =1; i <=n; i ++) 2 for ( j= row_ptr [i -1]; j < row_ptr [ i]; j ++) 3 y[i -1] += val [j -1]* x[ col_idx [j -1] -1]; Long outer loop (over the number of rows) n. Short inner loop (over nonzeros in each row) relative to pipeline length. y accessed sequentially, one load per entry/cache line. Same holds for val. Indirect addressing of x; only a problem if nonzeros not clustered around diagonal. (This can be achieved by applying suitable reordering techniques to the matrix initially.) Favorable ratio of data movement to arithmetic. Oliver Ernst (INMO) HPC Wintersemester 2012/13 232

43 Case study: Sparse matrix-vector multiplication Another popular sparse matrix format is JDS (jagged diagonal storage), which is constructed as follows: (1) All zero entries are eliminated. (2) The remaining (nonzero) entries in each row are shifted to the left. (3) The rows are reordered according to decreasing length. (4) An integer array perm records the associated permutation. (5) The resulting (dense) columns are arranged consecutively in a linear array val. (These are the jagged diagonals, as they traverse the matrix in a diagonal-like fashion.) (6) For each nonzero entry in val, the (original) column index is stored in an integer array col_idx. (7) The original column indices within each jagged diagonal are also permuted according to the permutation represented in perm. This allows the same ordering of the input and result vector. (8) An integer array jd_ptr records the beginning of each jagged diagonal. Oliver Ernst (INMO) HPC Wintersemester 2012/13 233

44 Case study: Sparse matrix-vector multiplication perm val original column index col_idx jd_ptr Oliver Ernst (INMO) HPC Wintersemester 2012/13 234

45 Case study: Sparse matrix-vector multiplication JDS sparse matrix-vector product y Ax (nd denotes # jagged diagonals) 1 for (d =1; d <= nd; d ++) { 2 diaglen = jd_ptr [ d] - jd_ptr [d -1]; 3 offset = jd_ptr [d -1] - 1; 4 for (i =1; i <= diaglen ; i ++) 5 y[i -1] += val [ offset +i -1] * x[ col_idx [ offset +i -1] -1]; 6 } Long inner loop (over length of diagonal), better for pipelining. Note: no dependencies in inner loop. Short outer loop (over # jagged diagonals). Multiple loads of result vector y. Sequential access of val. Indirect accessing of input vector x. Favorable access pattern if diagonals reasonably straight. Oliver Ernst (INMO) HPC Wintersemester 2012/13 235

46 Case study: Sparse matrix-vector multiplication We optimize JDS sparse matrix-vector multiplication by loop unrolling and loop fusion (sometimes called unroll-and-jam). Problem: unrolling and fusion requires inner loop length to be independent of outer loop index, but jagged diagonals may have different lengths. Solution: loop peeling. For m-way unrolling: cut uniform chunks of size m x, leaving m 1 partial diagonals for separate treatment. Oliver Ernst (INMO) HPC Wintersemester 2012/13 236

47 Case study: Sparse matrix-vector multiplication We optimize JDS sparse matrix-vector multiplication by loop unrolling and loop fusion (sometimes called unroll-and-jam). Problem: unrolling and fusion requires inner loop length to be independent of outer loop index, but jagged diagonals may have different lengths. Solution: loop peeling. For m-way unrolling: cut uniform chunks of size m x, leaving m 1 partial diagonals for separate treatment. Oliver Ernst (INMO) HPC Wintersemester 2012/13 236

48 Case study: Sparse matrix-vector multiplication We optimize JDS sparse matrix-vector multiplication by loop unrolling and loop fusion (sometimes called unroll-and-jam). Problem: unrolling and fusion requires inner loop length to be independent of outer loop index, but jagged diagonals may have different lengths. Solution: loop peeling. For m-way unrolling: cut uniform chunks of size m x, leaving m 1 partial diagonals for separate treatment. Oliver Ernst (INMO) HPC Wintersemester 2012/13 236

49 Case study: Sparse matrix-vector multiplication JDS sparse matrix-vector product y Ax with 2-way unroll and jam. 1 for (d =1; d <= nd; d +=2) { 2 diaglen = min ( jd_ptr [d]- jd_ptr [d -1], jd_ptr [d+1] - jd_ptr [d] ); 3 offset1 = jd_ptr [d -1] - 1; 4 offset2 = jd_ptr [ d] - 1; 5 for (i =1; i <= diaglen ; i ++) { 6 y[i -1] += val [ offset1 +i -1] * x[ col_idx [ offset1 +i -1] -1]; 7 y[i -1] += val [ offset2 +i -1] * x[ col_idx [ offset2 +i -1] -1]; 8 } 9 // peeled off iterations 10 offset1 = jd_ptr [d -1]; 11 for (i= diaglen +1; i <=( diag_ptr [d+1] - diag_ptr [d]); i ++) 12 y[i -1] += val [ offset1 +i -1]* x[ col_idx [ offset1 +i -1] -1]; 13 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 237

50 Case study: Sparse matrix-vector multiplication For large m the number of registers becomes a bottleneck. Solution: add blocking. JDS sparse matrix-vector product y Ax with 4-way loop blocking. 1 // loop over blocks 2 for ( ib =1; ib <=n; ib +=b) { 3 block_start = ib; 4 block_end = min ( ib+b -1, n ); 5 // loop over diagonals in one block 6 for (d =1: d <= nd; d ++) { 7 diaglen = jd_ptr [ d]- jd_ptr [d -1]; 8 offset = jd_ptr [d -1] - 1; 9 if ( diaglen >= block_start ) 10 // standard JDS mv kernel 11 for (i= block_start ; i <= min ( block_len, diaglen ); i ++) 12 y[i] += val [ offset +i -1]* x[ col_idx ( offset +i -1] -1]; 13 } 14 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 238

51 Case study: Sparse matrix-vector multiplication The parallelization of sparse matrix-vector multiplication in CSR-format is trivial: 1 # pragma omp parallel for private ( j) 2 for (i =1; i <=n; i ++) 3 for ( j= row_ptr [i -1]; j < row_ptr [ i]; j ++) 4 y[i -1] += val [j -1]* x[ col_idx [j -1] -1]; OpenMP overhead amortized over long outer loop. Possible load imbalance if longer rows occur in a clustered arrangement. In this case suitable choice of schedule options (dynamic, guided) needed. Oliver Ernst (INMO) HPC Wintersemester 2012/13 239

52 Case study: Sparse matrix-vector multiplication Simple JDS algorithm also easily parallelized by sharing the inner loop: 1 # pragma omp parallel private (d, diaglen, offset ) 2 for (d =1; d <= nd; d ++) { 3 diaglen = jd_ptr [ d] - jd_ptr [d -1]; 4 offset = jd_ptr [ d] - 1; 5 # pragma omp for 6 for (i =1; i <= diaglen ; i ++) 7 y[i -1] += val [ offset +i -1] * x[ col_idx [ offset +i -1] -1]; 8 } OpenMP overhead amortized over long inner loop. No load imbalance since all inner loop iterations contain same amount of work. Only problem is bad serial performance of JDS matrix-vector multiplication. Oliver Ernst (INMO) HPC Wintersemester 2012/13 240

53 Case study: Sparse matrix-vector multiplication Blocked JDS algorithm parallelized by sharing the outer loop over blocks: 1 # pragma omp parallel for private ( block_start, block_end,i,d, \ 2 diaglen, offset ) 3 for ( ib =1; ib <=n; ib +=b) { 4 block_start = ib; 5 block_end = min ( ib+b -1, n ); 6 for (d =1: d <= nd; d ++) { 7 diaglen = jd_ptr [ d]- jd_ptr [d -1]; 8 offset = jd_ptr [ d] - 1; 9 if ( diaglen >= block_start ) 10 for (i= block_start ; i <= min ( block_len, diaglen ); i ++) 11 y[i] += val [ offset +i -1]* x[ col_idx ( offset +i -1] -1]; 12 } 13 } Even less overhead since parallel for directive around outer loop. More potential for load imbalance as matrix rows sorted in size. static scheduling appropriate. Oliver Ernst (INMO) HPC Wintersemester 2012/13 241

54 Case study: Sparse matrix-vector multiplication CRS JDS s3dkt3m2 sparse MVM benchmark on klio MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 242

55 Case study: Sparse matrix-vector multiplication s3dkt3m2 sparse MVM benchmark on node130 CRS JDS 2500 MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 243

56 Case study: Sparse matrix-vector multiplication CRS JDS fidapm37 sparse MVM benchmark on klio MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 244

57 Case study: Sparse matrix-vector multiplication fidapm37 sparse MVM benchmark on node130 CRS JDS MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 245

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Distributed Systems + Middleware Concurrent Programming with OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP Distributed Systems + Middleware Concurrent Programming with OpenMP Gianpaolo Cugola Dipartimento di Elettronica e Informazione Politecnico, Italy cugola@elet.polimi.it http://home.dei.polimi.it/cugola

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

CS691/SC791: Parallel & Distributed Computing

CS691/SC791: Parallel & Distributed Computing CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Parallel Programming

Parallel Programming Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems

More information

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and

More information

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen Programming with Shared Memory PART II HPC Fall 2007 Prof. Robert van Engelen Overview Parallel programming constructs Dependence analysis OpenMP Autoparallelization Further reading HPC Fall 2007 2 Parallel

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Shared Memory Programming with OpenMP

Shared Memory Programming with OpenMP Shared Memory Programming with OpenMP (An UHeM Training) Süha Tuna Informatics Institute, Istanbul Technical University February 12th, 2016 2 Outline - I Shared Memory Systems Threaded Programming Model

More information

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group Parallelising Scientific Codes Using OpenMP Wadud Miah Research Computing Group Software Performance Lifecycle Scientific Programming Early scientific codes were mainly sequential and were executed on

More information

Parallel Programming: OpenMP

Parallel Programming: OpenMP Parallel Programming: OpenMP Xianyi Zeng xzeng@utep.edu Department of Mathematical Sciences The University of Texas at El Paso. November 10, 2016. An Overview of OpenMP OpenMP: Open Multi-Processing An

More information

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP Topics Lecture 11 Introduction OpenMP Some Examples Library functions Environment variables 1 2 Introduction Shared Memory Parallelization OpenMP is: a standard for parallel programming in C, C++, and

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele Advanced C Programming Winter Term 2008/09 Guest Lecture by Markus Thiele Lecture 14: Parallel Programming with OpenMP Motivation: Why parallelize? The free lunch is over. Herb

More information

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2. OpenMP Overview in 30 Minutes Christian Terboven 06.12.2010 / Aachen, Germany Stand: 03.12.2010 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Parallel Regions,

More information

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing CS 590: High Performance Computing OpenMP Introduction Fengguang Song Department of Computer Science IUPUI OpenMP A standard for shared-memory parallel programming. MP = multiprocessing Designed for systems

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to. Slides prepared by : Farzana Rahman 1 Introduction to OpenMP Slides prepared by : Farzana Rahman 1 Definition of OpenMP Application Program Interface (API) for Shared Memory Parallel Programming Directive based approach with library support

More information

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer OpenMP examples Sergeev Efim Senior software engineer Singularis Lab, Ltd. OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.

More information

Scientific Computing

Scientific Computing Lecture on Scientific Computing Dr. Kersten Schmidt Lecture 20 Technische Universität Berlin Institut für Mathematik Wintersemester 2014/2015 Syllabus Linear Regression, Fast Fourier transform Modelling

More information

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen Programming with Shared Memory PART II HPC Fall 2012 Prof. Robert van Engelen Overview Sequential consistency Parallel programming constructs Dependence analysis OpenMP Autoparallelization Further reading

More information

Shared Memory Programming Model

Shared Memory Programming Model Shared Memory Programming Model Ahmed El-Mahdy and Waleed Lotfy What is a shared memory system? Activity! Consider the board as a shared memory Consider a sheet of paper in front of you as a local cache

More information

Introduction to OpenMP

Introduction to OpenMP Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University History De-facto standard for Shared-Memory

More information

[Potentially] Your first parallel application

[Potentially] Your first parallel application [Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel

More information

NUMERICAL PARALLEL COMPUTING

NUMERICAL PARALLEL COMPUTING Lecture 4: More on OpenMP http://people.inf.ethz.ch/iyves/pnc11/ Peter Arbenz, Andreas Adelmann Computer Science Dept, ETH Zürich, E-mail: arbenz@inf.ethz.ch Paul Scherrer Institut, Villigen E-mail: andreas.adelmann@psi.ch

More information

CSL 860: Modern Parallel

CSL 860: Modern Parallel CSL 860: Modern Parallel Computation Hello OpenMP #pragma omp parallel { // I am now thread iof n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. // Back to normal Parallel Construct Extremely

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number OpenMP C and C++ Application Program Interface Version 1.0 October 1998 Document Number 004 2229 001 Contents Page v Introduction [1] 1 Scope............................. 1 Definition of Terms.........................

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Programming Shared Memory Systems with OpenMP Part I. Book

Programming Shared Memory Systems with OpenMP Part I. Book Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine

More information

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP CS 470 Spring 2017 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism

More information

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP

More information

Parallel Computing Parallel Programming Languages Hwansoo Han

Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Computing Parallel Programming Languages Hwansoo Han Parallel Programming Practice Current Start with a parallel algorithm Implement, keeping in mind Data races Synchronization Threading syntax

More information

Alfio Lazzaro: Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B. Bertinoro Italy, 12 17 October 2009 Alfio Lazzaro:

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP CS 470 Spring 2018 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 A little about me! PhD Computer Engineering Texas A&M University Computer Science

More information

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - III Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018 OpenMP 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan HPC Consultant User Services Goals Acquaint users with the concept of shared memory parallelism Acquaint users with the basics of programming with OpenMP Discuss briefly the

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato OpenMP Application Program Interface Introduction Shared-memory parallelism in C, C++ and Fortran compiler directives library routines environment variables Directives single program multiple data (SPMD)

More information

Introduction to Standard OpenMP 3.1

Introduction to Standard OpenMP 3.1 Introduction to Standard OpenMP 3.1 Massimiliano Culpo - m.culpo@cineca.it Gian Franco Marras - g.marras@cineca.it CINECA - SuperComputing Applications and Innovation Department 1 / 59 Outline 1 Introduction

More information

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Mango DSP Top manufacturer of multiprocessing video & imaging solutions. 1 of 11 3/3/2005 10:50 AM Linux Magazine February 2004 C++ Parallel Increase application performance without changing your source code. Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

More information

Using OpenMP. Rebecca Hartman-Baker Oak Ridge National Laboratory

Using OpenMP. Rebecca Hartman-Baker Oak Ridge National Laboratory Using OpenMP Rebecca Hartman-Baker Oak Ridge National Laboratory hartmanbakrj@ornl.gov 2004-2009 Rebecca Hartman-Baker. Reproduction permitted for non-commercial, educational use only. Outline I. About

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 2 OpenMP Shared address space programming High-level

More information

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory

More information

15-418, Spring 2008 OpenMP: A Short Introduction

15-418, Spring 2008 OpenMP: A Short Introduction 15-418, Spring 2008 OpenMP: A Short Introduction This is a short introduction to OpenMP, an API (Application Program Interface) that supports multithreaded, shared address space (aka shared memory) parallelism.

More information

EPL372 Lab Exercise 5: Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf

More information

Data Environment: Default storage attributes

Data Environment: Default storage attributes COSC 6374 Parallel Computation Introduction to OpenMP(II) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) Edgar Gabriel Fall 2014 Data Environment: Default storage attributes

More information

Shared Memory Parallelism - OpenMP

Shared Memory Parallelism - OpenMP Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openmp/#introduction) OpenMP sc99 tutorial

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Christian Terboven 10.04.2013 / Darmstadt, Germany Stand: 06.03.2013 Version 2.3 Rechen- und Kommunikationszentrum (RZ) History De-facto standard for

More information

OpenMP Fundamentals Fork-join model and data environment

OpenMP Fundamentals Fork-join model and data environment www.bsc.es OpenMP Fundamentals Fork-join model and data environment Xavier Teruel and Xavier Martorell Agenda: OpenMP Fundamentals OpenMP brief introduction The fork-join model Data environment OpenMP

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

Shared Memory Programming with OpenMP (3)

Shared Memory Programming with OpenMP (3) Shared Memory Programming with OpenMP (3) 2014 Spring Jinkyu Jeong (jinkyu@skku.edu) 1 SCHEDULING LOOPS 2 Scheduling Loops (2) parallel for directive Basic partitioning policy block partitioning Iteration

More information

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) V. Akishina, I. Kisel, G. Kozlov, I. Kulakov, M. Pugach, M. Zyzak Goethe University of Frankfurt am Main 2015 Task Parallelism Parallelization

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Introduction [1] 1. Directives [2] 7

Introduction [1] 1. Directives [2] 7 OpenMP Fortran Application Program Interface Version 2.0, November 2000 Contents Introduction [1] 1 Scope............................. 1 Glossary............................ 1 Execution Model.........................

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++ OpenMP OpenMP Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum 1997-2002 API for Fortran and C/C++ directives runtime routines environment variables www.openmp.org 1

More information

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS.

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS. 0104 Cover (Curtis) 11/19/03 9:52 AM Page 1 JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 LINUX M A G A Z I N E OPEN SOURCE. OPEN STANDARDS. THE STATE

More information

Parallel Programming using OpenMP

Parallel Programming using OpenMP 1 OpenMP Multithreaded Programming 2 Parallel Programming using OpenMP OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard to perform shared-memory multithreading

More information

Parallel Programming using OpenMP

Parallel Programming using OpenMP 1 Parallel Programming using OpenMP Mike Bailey mjb@cs.oregonstate.edu openmp.pptx OpenMP Multithreaded Programming 2 OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard

More information

Shared Memory Parallelism using OpenMP

Shared Memory Parallelism using OpenMP Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत SE 292: High Performance Computing [3:0][Aug:2014] Shared Memory Parallelism using OpenMP Yogesh Simmhan Adapted from: o

More information

OpenMP - Introduction

OpenMP - Introduction OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı - 21.06.2012 Outline What is OpenMP? Introduction (Code Structure, Directives, Threads etc.) Limitations Data Scope Clauses Shared,

More information

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides Parallel Programming with OpenMP CS240A, T. Yang, 203 Modified from Demmel/Yelick s and Mary Hall s Slides Introduction to OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Introduction to OpenMP.

Introduction to OpenMP. Introduction to OpenMP www.openmp.org Motivation Parallelize the following code using threads: for (i=0; i

More information

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) COSC 6374 Parallel Computation Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) Edgar Gabriel Fall 2014 Introduction Threads vs. processes Recap of

More information

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer HPC-Lab Session 2: OpenMP M. Bader, A. Breuer Meetings Date Schedule 10/13/14 Kickoff 10/20/14 Q&A 10/27/14 Presentation 1 11/03/14 H. Bast, Intel 11/10/14 Presentation 2 12/01/14 Presentation 3 12/08/14

More information

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions. 1 of 10 3/3/2005 10:51 AM Linux Magazine March 2004 C++ Parallel Increase application performance without changing your source code. Parallel Processing Top manufacturer of multiprocessing video & imaging

More information

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah Introduction to OpenMP Martin Čuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu Overview Quick introduction. Parallel loops. Parallel loop directives. Parallel sections.

More information

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder OpenMP programming Thomas Hauser Director Research Computing thomas.hauser@colorado.edu CU meetup 1 Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 Worksharing constructs To date: #pragma omp parallel created a team of threads We distributed

More information

DPHPC: Introduction to OpenMP Recitation session

DPHPC: Introduction to OpenMP Recitation session SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf OpenMP An Introduction What is it? A set of compiler directives

More information

Parallel programming using OpenMP

Parallel programming using OpenMP Parallel programming using OpenMP Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Objectives of Training Acquaint users with the concept of shared memory parallelism Acquaint users with the basics of programming with OpenMP Memory System: Shared Memory

More information

High Performance Computing

High Performance Computing High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah Introduction to OpenMP Martin Čuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu Overview Quick introduction. Parallel loops. Parallel loop directives. Parallel sections.

More information

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013 OpenMP António Abreu Instituto Politécnico de Setúbal 1 de Março de 2013 António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 1 / 37 openmp what? It s an Application Program Interface

More information

Introduction to HPC and Optimization Tutorial VI

Introduction to HPC and Optimization Tutorial VI Felix Eckhofer Institut für numerische Mathematik und Optimierung Introduction to HPC and Optimization Tutorial VI January 8, 2013 TU Bergakademie Freiberg Going parallel HPC cluster in Freiberg 144 nodes,

More information

Parallel and Distributed Programming. OpenMP

Parallel and Distributed Programming. OpenMP Parallel and Distributed Programming OpenMP OpenMP Portability of software SPMD model Detailed versions (bindings) for different programming languages Components: directives for compiler library functions

More information

Shared Memory Programming Models I

Shared Memory Programming Models I Shared Memory Programming Models I Peter Bastian / Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-69120 Heidelberg phone: 06221/54-8264

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 8 ] OpenMP Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information