Introduction to High Performance Computing and Optimization

Size: px

Start display at page:

Download "Introduction to High Performance Computing and Optimization"

Nathaniel Austin Caldwell
5 years ago
Views:

1 Institut für Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13

2 Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code 3.1 Performance Measurement 3.2 Optimization Guidelines 3.3 Compiler-Aided Optimization 3.4 Combine example 3.5 Further Optimization Issues 4. Parallel Computing 4.1 Introduction 4.2 Scalability 4.3 Parallel Architechtures 4.4 Networks 5. OpenMP Programming Oliver Ernst (INMO) HPC Wintersemester 2012/13 1

3 Contents 1. Introduction 2. Processor Architecture 3. Optimization of Serial Code 4. Parallel Computing 5. OpenMP Programming Oliver Ernst (INMO) HPC Wintersemester 2012/13 195

4 Background Application Programming Interface (API) for shared-memory programming; supports C/C++ and Fortran, all architectures; implemented by many compilers; managed by OpenMP Architecture Review Board (ARB). Initiated by vendors of SMP systems in late 1990s to provide portable language extensions for programming SMP architectures. Introduced 1997, current specification is Version 3.1 (July 2011); Version 4.0 exists as release candidate (November 2012). openmp.org Feature set: OpenMP API comprises a set of compiler directives, runtime libraries and environment variables. These allow programmes to create teams of threads for parallel execution, specify how to share work among the members of a team, declare both shared and private variables, and synchronize threads and enable them to perform certain operations exclusively (i.e., without interference by other threads). Oliver Ernst (INMO) HPC Wintersemester 2012/13 196

MIT Press, 2007 OpenMP standard specification (currently 3.

5 Literature R. Chandra et al. Parallel Programming in OpenMP. Morgan Kaufmann, 2000 B. Chapman, G. Jost and R. van der Pas. Using OpenMP. MIT Press, 2007 OpenMP standard specification (currently 3.1) Available from openmp.osrg Manual of GNU OpenMP implementation libgomp Oliver Ernst (INMO) HPC Wintersemester 2012/13 197

6 Basic Model initial thread (serial region) Fork Join team of threads in (one or more) parallel regions initial thread (serial region) Work shared among cooperating threads. Shared and private memory OpenMP handles many basic system tasks (as opposed to, say, POSIX threads or Intel s Thread Building Blocks, TBB) Incremental parallelization of given serial code possible. Oliver Ernst (INMO) HPC Wintersemester 2012/13 198

7 Processes and threads Process: operating systems term (the older of the two); essentially the code of a program combined with state information. A process can consist of multiple threads; these share a common address space. Threads have their own program counter (instruction pointer) and stack. Threads are lightweight processes in the sense that they can be started (spawned) with much less overhead than starting (forking) a new process. Thread communication is cheaper (via the shared memory) than inter-process communication. In OpenMP: Each thread has unique thread ID, can be queried by API function omp_get_thread_num(). Number of parallel threads set by environment variable. export OMP_NUM_THREADS =4 Oliver Ernst (INMO) HPC Wintersemester 2012/13 199

8 Data Scoping By default: variables existing before parallel region still exist inside and are shared by all threads. Three ways to make variables private: (1) Apply private clause to existing variable, creating separate instance of this variable for each thread. (2) Index variable for work-sharing loop. (3) Local variables in subroutine called from private region private to each thread. Note: local variables with C storage class static remain shared. In C: private clause applies to entire structured block and thus also to variables declared there. 1 # pragma omp parallel 2 { 3 int bstart, bend, blen, numth, tid, i; 4... // calculate loop boundaries 5 for (i= bstart ; i <= bend ; i ++) 6 a[ i] = b[ i] + c[ i]; 7 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 200

9 Directives include header omp.h for function prototypes. #pragma directives: control the actions of the compiler without affecting the program as a whole. OpenMP directives in C/C++ inserted via #pragma omp <directive [clause]>. Conditional compilation: activating compiler s OpenMP switch sets preprocessor macro # ifdef _OPENMP... do something # endif Directive # pragma omp parallel < structured block > Makes the structured block a parallel region: All code executed between start and end of this region is executed by all threads. This includes subroutine calls within the region (unless explicitly sequentialized) Local variables inside the block are automatically private to each thread Oliver Ernst (INMO) HPC Wintersemester 2012/13 201

10 The loop construct 1 # pragma omp for 2 <for loop > Causes the iterations of the loop immediately following to be executed in parallel. Restricted to loops for which number of iterations known a priori. 1 # pragma omp for 2 for ( i =0; i<n; i ++) 3 printf (" Thread % d executes loop iteration % d\ n", omp_get_thread_num (),i); Example of parallel for loop. Note: Only executes in parallel if enclosed in parallel region. Otherwise say 1 # pragma omp parallel for 2 for ( i =0; i<n; i ++) 3 printf (" Thread % d executes loop iteration % d\ n", omp_get_thread_num (),i); Oliver Ernst (INMO) HPC Wintersemester 2012/13 202

11 The loop construct Example output for n = 9 and 4 threads: 1 Thread 0 executes loop iteration 0 2 Thread 0 executes loop iteration 1 3 Thread 0 executes loop iteration 2 4 Thread 3 executes loop iteration 7 5 Thread 3 executes loop iteration 8 6 Thread 2 executes loop iteration 5 7 Thread 2 executes loop iteration 6 8 Thread 1 executes loop iteration 3 9 Thread 1 executes loop iteration 4 Note: here no assignment of iterations to threads was specified. In this case it is the compiler s decision. Oliver Ernst (INMO) HPC Wintersemester 2012/13 203

12 Example Quadrature π = x 2 dx 1 n n j=1 f(x j ), x j = j 1 2 n f (x) = 4 1+x f (x) x Oliver Ernst (INMO) HPC Wintersemester 2012/13 204

13 Example Quadrature 1 double rect ( int N) 2 { 3 double h, I = 0, x; 4 int j; 5 6 h = 1.0 / N; 7 for (j =0; j < N; j ++) { 8 x = (j +0.5) *h; 9 I += (4.0 / (1.0 + x * x)); 10 } 11 return I * h; 12 } C code for approximating Pi by quadrature Each loop iteration independent. Divide this work across several threads. Oliver Ernst (INMO) HPC Wintersemester 2012/13 205

14 Example Quadrature 1 # include <omp.h> 2 3 double rect ( int N) 4 { 5 double h, I=0.0, x, sum ; 6 int j; 7 8 h = 1.0/ N; 9 # pragma omp parallel private (x, sum ) 10 { 11 sum = 0.0; 12 # pragma omp for 13 for (j =0; j < N; j ++) { 14 x = (j +0.5) *h; 15 sum += (4.0 / (1.0 + x * x)); 16 } 17 # pragma omp critical 18 I = I + h* sum ; 19 } 20 return I; 21 } C code for approximating Pi by quadrature with OpenMP directives Oliver Ernst (INMO) HPC Wintersemester 2012/13 206

15 Clauses Some OpenMP directives also support clauses. In quadrature example: private(x,sum) appears as a clause to the parallel directive. It declares the variables x and sum to be private to each thread. Clauses supported by the loop construct: private(<list>) firstprivate(<list>): pre-initialize variables with value of shared variable of same name before parallel region. lastprivate(<list>) make value of private variable from last iteration after parallel region. reduction(<operator>:<list>): applies reduction operator to variable after parallel region. ordered: force execution in order. schedule(<kind[,chunk size]>): allows control over assignment of iterations to threads. nowait: omit implied barrier at end of parallel region. Oliver Ernst (INMO) HPC Wintersemester 2012/13 207

16 Synchronization: barrier construct When several threads write to the same shared variable, a race condition results if no serialization is imposed. OpenMP supplies the latter with the critical construct. 1 # pragma omp critical 2 < structured block > The structured block is declared to be a critical section of code which may be entered only by one thread at a time. Badly arranged critical regions can lead to deadlocks: threads waiting for events which never occur. Reason: when a thread encounters a critical directive inside a critical region it will block forever. Solution: names for critical directives. Only critical sections with same name are locked. 1 # pragma omp critical ( <name >) 2 < structured block > Oliver Ernst (INMO) HPC Wintersemester 2012/13 208

17 Synchronization: barrier construct When several threads write to the same shared variable, a race condition results if no serialization is imposed. OpenMP supplies the latter with the critical construct. 1 # pragma omp critical 2 < structured block > The structured block is declared to be a critical section of code which may be entered only by one thread at a time. Badly arranged critical regions can lead to deadlocks: threads waiting for events which never occur. Reason: when a thread encounters a critical directive inside a critical region it will block forever. Solution: names for critical directives. Only critical sections with same name are locked. 1 # pragma omp critical ( <name >) 2 < structured block > Oliver Ernst (INMO) HPC Wintersemester 2012/13 208

18 Synchronization: barrier construct Example: 1 # pragma omp parallel for private ( x) 2 3 for (i =1; i <=N; i ++) { 4 x = sin (2* M_PI *x/n); 5 # pragma omp critical ( psum ) 6 sum += func (x); 7 } double func ( double v) { 10 # pragma omp critical ( prand ) 11 func = v + random_func (); 12 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 209

19 Synchronization: barrier construct 1 # pragma omp barrier Causes a thread to wait until all remaining threads have reached this point in the code Applies to the innermost enclosing parallel region. Many OpenMP constructs contain an implied barrier (added by compiler). Potential for deadlocks if not all threads reach this statement. Sequence of work-sharing regions and barrier regions must be the same for all threads. Most often used to avoid a data race condition: barrier between write and subsequent read from shared variable. Note: use only where necessary as barriers cause synchronization overhead. Oliver Ernst (INMO) HPC Wintersemester 2012/13 210

20 Reduction clause 1 # pragma omp parallel for reduction (+: sum ) 2 for ( i =0; i<n; i ++) 3 sum += a[ i]; Associative and commutative mathematical operations like +,, used to accumulate private variables into one shared variable. Programmer specifies operation and variable which holds result. This is shared after exiting the parallel region. Usually faster than manual implementation. Automatic initialization of private reduction variables with appropriate values (e.g. 0 for +, 1 for ). (Note: danger of producing incorrect serial code!) Set of supported operations slightly different between C/C++ and Fortran. Oliver Ernst (INMO) HPC Wintersemester 2012/13 211

21 Schedule clause 1 # pragma omp parallel for schedule ( static ) 2 for ( i =0; i<n; i ++) 3 a[i] = calculate (i); Used to assign loop iterations to threads. Syntax: schedule(<kind>[,<chunk-size>]) Granularity of workload distribution determined by expression chunk-size. Chunk: contiguous, nonempty subrange of iteration index. 4 kinds: static : cyclically in order of thread id dynamic : as threads request them; default chunk-size is 1. guided : as in dynamic, but chunk-size proportional to remaining number of iterations divided by number of threads. runtime : scheduling determined at runtime as set in OMP_SCHEDULE environment variable. Oliver Ernst (INMO) HPC Wintersemester 2012/13 212

22 Schedule clause Example: Different assignments of 20 iterations of a loop to 3 threads using schedule clause. Iteration T0 T1 T2 STATIC STATIC,3 DYNAMIC[,1] DYNAMIC,3 GUIDED[,1] source: Hager & Wellein Oliver Ernst (INMO) HPC Wintersemester 2012/13 213

23 Single construct 1 # pragma omp single 2 < structured block > Specifies that structured block immediately following should be executed by one thread only. Which thread not specified and will vary across runs (and across single constructs). Oliver Ernst (INMO) HPC Wintersemester 2012/13 214

24 Single construct (example) 1 # pragma omp parallel shared (a, b) private ( i) 2 { 3 # pragma omp single 4 { 5 a = 10; 6 printf (" Single construct executed by thread % d\ n", omp_get_thread_num ()); 7 } /* Barrier automatically inserted here */ 8 # pragma omp for 9 for ( i =0; i<n; i ++) 10 b[ i] = a; 11 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 215

25 Task construct 1 # pragma omp task [< clause >[[,] < clause >]...] 2 < structured block > Added with OpenMP standard 3.0 to parallelize collections of tasks not easily expressed as a loop over an index set. Task construct: task directive plus structured block. Thread encountering parallel construct packages up a set of tasks, one per thread. Team of therads is created Each thread in the team assigned one of the tasks. Implicit barrier holds master thread until all implicit tasks completed. Oliver Ernst (INMO) HPC Wintersemester 2012/13 216

26 Task construct: example 1 2 # pragma omp parallel 3 { 4 # pragma omp single private ( p) 5 { 6 p = listhead ; 7 while (p) { 8 # pragma omp task 9 process ( p); // p is firstprivate by default 10 p= next (p); 11 } 12 } 13 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 217

27 Conditional compilation Issue: different code depending on whether OpenMP active or not. OpenMP directives automatically ignored. Calls to OpenMP API functions can be masked out with C/C++ preprocessor symbol _OPENMP. Oliver Ernst (INMO) HPC Wintersemester 2012/13 218

28 Memory consistency 1 myid = 0; 2 numthreads = 1; 3 # ifdef _OPENMP 4 # pragma omp parallel private ( myid ) 5 { 6 myid = omp_get_threadnum (); 7 # pragma omp single 8 numthreads = omp_get_num_threads (); 9 # pragma omp critical 10 printf (" Parallel program, this is thread % d of % d.\ n", myid, numthreads ); 11 } 12 # else 13 printf (" Serial program.\n") 14 # endif Call to get_num_threads() in critical region since numthreads shared. For correct values to be printed all threads must wait until thread executing single region has completed (implicit barrier after single ensures this). Variables held in registers written out so that cache coherency can guarantee memory consistency (cf. OpenMP flush directive). Oliver Ernst (INMO) HPC Wintersemester 2012/13 219

29 Thread safety Print statement in line 10 serialized (protected by critical region) to prevent multiple treads output streams interfering. General rule: I/O operations, general OS functionality, common library functions should be serialized because they may not be thread safe. Prominent example: rand() function from the C library, as it uses a static variable to store its hidden state (the seed). Oliver Ernst (INMO) HPC Wintersemester 2012/13 220

30 Affinity OpenMP standard provides no way to bind threads to cores. No provisions for locality constraits. Can t rely on OS to do this well. Remedy: OS-level tools such as likwid. Oliver Ernst (INMO) HPC Wintersemester 2012/13 221

31 Case study: Jacobi relaxation One of the oldest iterative methods for solving a linear system of equations Ax = b, A R N N, b R N, is Jacobi s method: given an approximation x m A 1 b, a new approximation x m+1 is obtained by updating each component of x m by assuming all remaining components are correct: for the i-th equation: one obtains a i,1 [x m ] 1 + a i,2 [x m ] a i,n [x m ] N = [b] i, 1 i N, [x m+1 ] i = ( [b] i N a i,j [x m ] j )/a i,i. j=1 j i In matrix notation, setting D := diag(a), the complete update reads x m+1 = x m + D 1 (b Ax m ). Oliver Ernst (INMO) HPC Wintersemester 2012/13 222

32 Case study: Jacobi relaxation When used to solve linear systems arising from the discretization of PDEs, such iterations are called relaxation methods. 5-point FD discretization of u = f on Ω = (0, 1) 2 u Ω = 0, on uniform grid (x i, y j ) of mesh width h = 1/(n + 1) leads to system Au = f with f i,j = f(x i, y j ), u i,j u(x i, y j ) and y x h [Au] i,j = 4u i,j u i 1,j u i+1,j u i,j 1 u i,j+1 h 2. Oliver Ernst (INMO) HPC Wintersemester 2012/13 223

33 Case study: Jacobi relaxation One step of Jacobi relaxation thus updates each component [u m ] i,j of u m as [u m+1 ] i,j = h2 4 [f ] i,j + [u m] i 1,j + [u m ] i+1,j + [u m ] i,j 1 + [u m ] i,j+1. 4 In MATLAB: 1 G = numgrid ( S,n +2) ; h = 1/( n +1) ; 2 A = 1/h^2 * delsq (G); f = ones (n ^2,1) ; 3 u_ex = A\ f; 4 5 u = zeros ( size (f)); r = f - A*u; 6 D = spdiags ( diag (A),0,n^2,n ^2) ; 7 for j =1:100 8 u = u + D\ r; 9 r = f - A* u; 10 errnorm (j) = norm (u_ex -u)/ norm ( u_ex ); 11 end Oliver Ernst (INMO) HPC Wintersemester 2012/13 224

34 Case study: Jacobi relaxation Not the fastest linear solver relative error 10 1 N=10 N=20 N= m but still a common computational pattern (stencil-based computation, regular mesh). Oliver Ernst (INMO) HPC Wintersemester 2012/13 225

35 Case study: Jacobi relaxation 1 while ( maxdelta > eps && it < 5) { 2 maxdelta = localmax = 0.; it ++; 3 # pragma omp parallel private (i,k, localmax,tmp ) 4 { 5 # pragma omp for 6 for (i = 1; i <= N; i ++) { 7 for (k = 1; k <= N; k ++) { 8 // four flops, one store, four loads 9 phi [t1 ][i][k] = ( phi [t0 ][i +1][ k] + phi [t0 ][i -1][ k] + 10 phi [t0 ][i][k+1] + phi [t0 ][i][k -1] ) * 0.25; 11 if ( phi [t1 ][i][k] < phi [t0 ][i][k]) 12 tmp = phi [t0 ][i][k] - phi [t1 ][i][k]; 13 else 14 tmp = phi [t1 ][i][k] - phi [t0 ][i][k]; 15 if ( localmax < tmp ) 16 localmax = tmp ; 17 // localmax = fmax ( localmax, fabs ( phi [t1 ][i][k]- phi [t0 ][i][k]) ); 18 } 19 } 20 # pragma omp flush ( maxdelta ) 21 if ( localmax > maxdelta ) { 22 # pragma omp critical 23 if ( localmax > maxdelta ) 24 maxdelta = localmax ; 25 } 26 } 27 i = t0; t0 = t1; t1 = i; // t0 ^= t1 ; t1 ^= t0 ; t0 ^= t1 ; // swap arrays 28 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 226

36 Case study: Jacobi relaxation Hager and Wellein s timings: Performance [MLUPs/sec] thread 2 threads, 1 socket 2 threads, 2 sockets 4 threads 4 MB L2 8 MB L2 P P P P socket 32k L1D 32k L1D 32k L1D 32k L1D 4MB L2 4MB L2 Chipset Memory N Oliver Ernst (INMO) HPC Wintersemester 2012/13 227

37 Case study: Jacobi relaxation Our timings: MLUP/s thread 2 threads, 1 socket 2 threads, 2 sockets 3 threads 4 threads 2D Jacobi OpenMP on klio N Oliver Ernst (INMO) HPC Wintersemester 2012/13 228

38 Case study: Sparse matrix-vector multiplication Many scientific computing problems (PDEs, optimization, graph analysis) feature matrices which are sparse, i.e., contain only a small number of nonzero entries in each row/column. Exploiting this fact requires storing these matrices in data structures which store (essentially) only nonzero entries and developing algorithms for matrix operations for these data structures. The resulting sparse matrix format dramatically reduces operation and storage complexity for operations with sparse matrices. Different formats have been developed reflecting specific application properties and computer hardware behavior. We compare two with regard to OpenMP parallelization on current SMP systems. Oliver Ernst (INMO) HPC Wintersemester 2012/13 229

39 Case study: Sparse matrix-vector multiplication Many scientific computing problems (PDEs, optimization, graph analysis) feature matrices which are sparse, i.e., contain only a small number of nonzero entries in each row/column. Exploiting this fact requires storing these matrices in data structures which store (essentially) only nonzero entries and developing algorithms for matrix operations for these data structures. The resulting sparse matrix format dramatically reduces operation and storage complexity for operations with sparse matrices. Different formats have been developed reflecting specific application properties and computer hardware behavior. We compare two with regard to OpenMP parallelization on current SMP systems. Oliver Ernst (INMO) HPC Wintersemester 2012/13 229

40 Case study: Sparse matrix-vector multiplication In CRS (compressed row storage), also known as CSR (compressed sparse row), a (linear) array val of floating point numbers stores the nonzero matrix entries row-wise; a corresponding integer array col_idx contains the column indices of the nonzero matrix entries and a second integer array row_ptr contains for each row the index in val and col_idx of this row s first entry. Oliver Ernst (INMO) HPC Wintersemester 2012/13 230

41 Case study: Sparse matrix-vector multiplication Example: for the sparse matrix A = the CRS representation is val row_ptr col_idx Oliver Ernst (INMO) HPC Wintersemester 2012/13 231

42 Case study: Sparse matrix-vector multiplication CRS sparse matrix-vector product y Ax (n denotes # rows) 1 for (i =1; i <=n; i ++) 2 for ( j= row_ptr [i -1]; j < row_ptr [ i]; j ++) 3 y[i -1] += val [j -1]* x[ col_idx [j -1] -1]; Long outer loop (over the number of rows) n. Short inner loop (over nonzeros in each row) relative to pipeline length. y accessed sequentially, one load per entry/cache line. Same holds for val. Indirect addressing of x; only a problem if nonzeros not clustered around diagonal. (This can be achieved by applying suitable reordering techniques to the matrix initially.) Favorable ratio of data movement to arithmetic. Oliver Ernst (INMO) HPC Wintersemester 2012/13 232

43 Case study: Sparse matrix-vector multiplication Another popular sparse matrix format is JDS (jagged diagonal storage), which is constructed as follows: (1) All zero entries are eliminated. (2) The remaining (nonzero) entries in each row are shifted to the left. (3) The rows are reordered according to decreasing length. (4) An integer array perm records the associated permutation. (5) The resulting (dense) columns are arranged consecutively in a linear array val. (These are the jagged diagonals, as they traverse the matrix in a diagonal-like fashion.) (6) For each nonzero entry in val, the (original) column index is stored in an integer array col_idx. (7) The original column indices within each jagged diagonal are also permuted according to the permutation represented in perm. This allows the same ordering of the input and result vector. (8) An integer array jd_ptr records the beginning of each jagged diagonal. Oliver Ernst (INMO) HPC Wintersemester 2012/13 233

44 Case study: Sparse matrix-vector multiplication perm val original column index col_idx jd_ptr Oliver Ernst (INMO) HPC Wintersemester 2012/13 234

45 Case study: Sparse matrix-vector multiplication JDS sparse matrix-vector product y Ax (nd denotes # jagged diagonals) 1 for (d =1; d <= nd; d ++) { 2 diaglen = jd_ptr [ d] - jd_ptr [d -1]; 3 offset = jd_ptr [d -1] - 1; 4 for (i =1; i <= diaglen ; i ++) 5 y[i -1] += val [ offset +i -1] * x[ col_idx [ offset +i -1] -1]; 6 } Long inner loop (over length of diagonal), better for pipelining. Note: no dependencies in inner loop. Short outer loop (over # jagged diagonals). Multiple loads of result vector y. Sequential access of val. Indirect accessing of input vector x. Favorable access pattern if diagonals reasonably straight. Oliver Ernst (INMO) HPC Wintersemester 2012/13 235

46 Case study: Sparse matrix-vector multiplication We optimize JDS sparse matrix-vector multiplication by loop unrolling and loop fusion (sometimes called unroll-and-jam). Problem: unrolling and fusion requires inner loop length to be independent of outer loop index, but jagged diagonals may have different lengths. Solution: loop peeling. For m-way unrolling: cut uniform chunks of size m x, leaving m 1 partial diagonals for separate treatment. Oliver Ernst (INMO) HPC Wintersemester 2012/13 236

47 Case study: Sparse matrix-vector multiplication We optimize JDS sparse matrix-vector multiplication by loop unrolling and loop fusion (sometimes called unroll-and-jam). Problem: unrolling and fusion requires inner loop length to be independent of outer loop index, but jagged diagonals may have different lengths. Solution: loop peeling. For m-way unrolling: cut uniform chunks of size m x, leaving m 1 partial diagonals for separate treatment. Oliver Ernst (INMO) HPC Wintersemester 2012/13 236

48 Case study: Sparse matrix-vector multiplication We optimize JDS sparse matrix-vector multiplication by loop unrolling and loop fusion (sometimes called unroll-and-jam). Problem: unrolling and fusion requires inner loop length to be independent of outer loop index, but jagged diagonals may have different lengths. Solution: loop peeling. For m-way unrolling: cut uniform chunks of size m x, leaving m 1 partial diagonals for separate treatment. Oliver Ernst (INMO) HPC Wintersemester 2012/13 236

49 Case study: Sparse matrix-vector multiplication JDS sparse matrix-vector product y Ax with 2-way unroll and jam. 1 for (d =1; d <= nd; d +=2) { 2 diaglen = min ( jd_ptr [d]- jd_ptr [d -1], jd_ptr [d+1] - jd_ptr [d] ); 3 offset1 = jd_ptr [d -1] - 1; 4 offset2 = jd_ptr [ d] - 1; 5 for (i =1; i <= diaglen ; i ++) { 6 y[i -1] += val [ offset1 +i -1] * x[ col_idx [ offset1 +i -1] -1]; 7 y[i -1] += val [ offset2 +i -1] * x[ col_idx [ offset2 +i -1] -1]; 8 } 9 // peeled off iterations 10 offset1 = jd_ptr [d -1]; 11 for (i= diaglen +1; i <=( diag_ptr [d+1] - diag_ptr [d]); i ++) 12 y[i -1] += val [ offset1 +i -1]* x[ col_idx [ offset1 +i -1] -1]; 13 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 237

50 Case study: Sparse matrix-vector multiplication For large m the number of registers becomes a bottleneck. Solution: add blocking. JDS sparse matrix-vector product y Ax with 4-way loop blocking. 1 // loop over blocks 2 for ( ib =1; ib <=n; ib +=b) { 3 block_start = ib; 4 block_end = min ( ib+b -1, n ); 5 // loop over diagonals in one block 6 for (d =1: d <= nd; d ++) { 7 diaglen = jd_ptr [ d]- jd_ptr [d -1]; 8 offset = jd_ptr [d -1] - 1; 9 if ( diaglen >= block_start ) 10 // standard JDS mv kernel 11 for (i= block_start ; i <= min ( block_len, diaglen ); i ++) 12 y[i] += val [ offset +i -1]* x[ col_idx ( offset +i -1] -1]; 13 } 14 } Oliver Ernst (INMO) HPC Wintersemester 2012/13 238

51 Case study: Sparse matrix-vector multiplication The parallelization of sparse matrix-vector multiplication in CSR-format is trivial: 1 # pragma omp parallel for private ( j) 2 for (i =1; i <=n; i ++) 3 for ( j= row_ptr [i -1]; j < row_ptr [ i]; j ++) 4 y[i -1] += val [j -1]* x[ col_idx [j -1] -1]; OpenMP overhead amortized over long outer loop. Possible load imbalance if longer rows occur in a clustered arrangement. In this case suitable choice of schedule options (dynamic, guided) needed. Oliver Ernst (INMO) HPC Wintersemester 2012/13 239

52 Case study: Sparse matrix-vector multiplication Simple JDS algorithm also easily parallelized by sharing the inner loop: 1 # pragma omp parallel private (d, diaglen, offset ) 2 for (d =1; d <= nd; d ++) { 3 diaglen = jd_ptr [ d] - jd_ptr [d -1]; 4 offset = jd_ptr [ d] - 1; 5 # pragma omp for 6 for (i =1; i <= diaglen ; i ++) 7 y[i -1] += val [ offset +i -1] * x[ col_idx [ offset +i -1] -1]; 8 } OpenMP overhead amortized over long inner loop. No load imbalance since all inner loop iterations contain same amount of work. Only problem is bad serial performance of JDS matrix-vector multiplication. Oliver Ernst (INMO) HPC Wintersemester 2012/13 240

53 Case study: Sparse matrix-vector multiplication Blocked JDS algorithm parallelized by sharing the outer loop over blocks: 1 # pragma omp parallel for private ( block_start, block_end,i,d, \ 2 diaglen, offset ) 3 for ( ib =1; ib <=n; ib +=b) { 4 block_start = ib; 5 block_end = min ( ib+b -1, n ); 6 for (d =1: d <= nd; d ++) { 7 diaglen = jd_ptr [ d]- jd_ptr [d -1]; 8 offset = jd_ptr [ d] - 1; 9 if ( diaglen >= block_start ) 10 for (i= block_start ; i <= min ( block_len, diaglen ); i ++) 11 y[i] += val [ offset +i -1]* x[ col_idx ( offset +i -1] -1]; 12 } 13 } Even less overhead since parallel for directive around outer loop. More potential for load imbalance as matrix rows sorted in size. static scheduling appropriate. Oliver Ernst (INMO) HPC Wintersemester 2012/13 241

54 Case study: Sparse matrix-vector multiplication CRS JDS s3dkt3m2 sparse MVM benchmark on klio MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 242

55 Case study: Sparse matrix-vector multiplication s3dkt3m2 sparse MVM benchmark on node130 CRS JDS 2500 MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 243

56 Case study: Sparse matrix-vector multiplication CRS JDS fidapm37 sparse MVM benchmark on klio MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 244

57 Case study: Sparse matrix-vector multiplication fidapm37 sparse MVM benchmark on node130 CRS JDS MFlops/s (1/0) 2 (2/0) 2 (1/1) 3 (2/1) 4 (2/2) #Threads (on Socket 1/on Socket 2) Oliver Ernst (INMO) HPC Wintersemester 2012/13 245

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi