Synchronous Shared Memory Parallel Examples. HPC Fall 2010 Prof. Robert van Engelen

Similar documents
Synchronous Computation Examples. HPC Fall 2008 Prof. Robert van Engelen

Synchronous Shared Memory Parallel Examples. HPC Fall 2012 Prof. Robert van Engelen

Parallel Programming. March 15,

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Synchronous Computations

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Introduction to OpenMP

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

High Performance Fortran. James Curry

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Numerical Algorithms

Lecture 3. Stencil methods Multithreaded programming with OpenMP

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Lecture 3. Stencil methods Multithreaded programming with OpenMP

Lecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications

(Sparse) Linear Solvers

EE/CSCI 451: Parallel and Distributed Computation

Shared Memory Programming Model

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP! baseado em material cedido por Cristiana Amza!

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Parallel Programming with OpenMP. CS240A, T. Yang

/Users/engelen/Sites/HPC folder/hpc/openmpexamples.c

Simulating ocean currents

ECE 574 Cluster Computing Lecture 10

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel and Distributed Computing

Introduction to OpenMP

Allows program to be incrementally parallelized

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Poisson Solver in Fortran

Exploring Parallelism At Different Levels

Parallelization of an Example Program

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Concurrent Programming with OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Parallel Numerical Algorithms

Introduction to OpenMP

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Chapter 3. OpenMP David A. Padua

Parallel Programming in C with MPI and OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Synchronous Computations

Parallel Programming in C with MPI and OpenMP

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Lecture 4: OpenMP Open Multi-Processing

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Scientific Computing

Automatic Polyhedral Optimization of Stencil Codes

Distributed Systems + Middleware Concurrent Programming with OpenMP

19.1. Unit 19. OpenMP Library for Parallelism

Introduction to OpenMP

Shared Memory programming paradigm: openmp

NUMERICAL PARALLEL COMPUTING

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Shared Memory Parallelism - OpenMP

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Iterative Methods for Linear Systems

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Parallel Programming

Overview of OpenMP. Unit 19. Using OpenMP. Parallel for. OpenMP Library for Parallelism

Parallel Computing Parallel Programming Languages Hwansoo Han

High Performance Computing

AMath 483/583 Lecture 24

MPI Programming. Henrik R. Nagel Scientific Computing IT Division

Synchronous Iteration

5.12 EXERCISES Exercises 263

And some slides from OpenMP Usage / Orna Agmon Ben-Yehuda

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Introduction to Parallel. Programming

OpenMP on Ranger and Stampede (with Labs)

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

PARALLEL PROGRAMMING MULTI- CORE COMPUTERS

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Dense Matrix Algorithms

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

CS4961 Parallel Programming. Lecture 13: Task Parallelism in OpenMP 10/05/2010. Programming Assignment 2: Due 11:59 PM, Friday October 8

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

Example of a Parallel Algorithm

Assignment 1 OpenMP Tutorial Assignment

Lecture 2. Memory locality optimizations Address space organization

Unrolling parallel loops

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

AMath 483/583 Lecture 21 May 13, 2011

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Concurrent Programming with OpenMP

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Transcription:

Synchronous Shared Memory Parallel Examples HPC Fall 2010 Prof. Robert van Engelen

Examples Data parallel prefix sum and OpenMP example Task parallel prefix sum and OpenMP example Simple heat distribution problem with OpenMP Iterative solver with OpenMP Simple heat distribution problem with HPF Iterative solver with HPF Gaussian Elimination with HPF HPC Fall 2010 2

Dataparallel Prefix Sum Serial sum[0] = x[0]; for (i = 1; i<n; i++) sum[i] = sum[i-1] + x[i]; Dataparallel for (j = 0; j < log2(n); j++) forall (i = 2 j ; i < n; i++) x[i] = x[i] + x[i - 2 j ]; Dataparallel = synchronous (lock-step) Time complexity? Parallel efficiency and speedup? Dataparallel forall: concurrent write and read, but read always fetches old value: forall has copy-in-copy-out semantics (viz. CRCW PRAM model) HPC Fall 2010 3

OpenMP Prefix Sum v1 for (j = 0; j < log2(n); j++) #pragma omp parallel private(i) #pragma omp for for (i = 1<<j; i < n; i++) t[i] = x[i] + x[i - 1<<j]; Copy out #pragma omp for for (i = 1<<j; i < n; i++) x[i] = t[i]; What about overhead? Note: use bitshift to compute 2 j = 1<<j HPC Fall 2010 4

Task Parallel Prefix Sum for each processor 0 < p < n private j for (j = 1; j < n; j = 2*j) if (p >= j) x[p] = x[p] + x[p-j]; barrier Task/thread-parallel: best to parallelize outer loops HPC Fall 2010 5

OpenMP Prefix Sum v2 #pragma omp parallel shared(n,x) private(j,tid) num_threads(n) tid = omp_get_thread_num(); for (j = 1; j < n; j = 2*j) if (tid >= j) x[tid] = x[tid] + x[tid - j]; #pragma omp barrier Uses n threads! But what if n is large? HPC Fall 2010 6

OpenMP Prefix Sum v3 #pragma omp parallel shared(n,nthr,x,z) private(i,j,tid,work,lo,hi) #pragma omp single nthr = omp_get_num_threads(); Note: assumes nthreads = 2 k tid = omp_get_thread_num(); work = (n + nthr-1) / nthr; lo = work * tid; hi = lo + work; if (hi > n) hi = n; for (i = lo+1; i < hi; i++) x[i] = x[i] + x[i-1]; Local prefix sum over x z[tid] = x[hi-1]; #pragma omp barrier z = local prefix sum x[hi] for (j = 1; j < nthr; j = 2*j) if (tid >= j) z[tid] = z[tid] + z[tid - j]; Global prefix sum over z #pragma omp barrier for (i = lo; i < hi; i++) x[i] = x[i] + z[tid] - x[hi-1]; Update local prefix sum x HPC Fall 2010 7

Dataparallel Heat Distribution Problem for (iter = 0; iter < limit; iter++) forall (i = 0; i < n; i++) forall (j = 0; j < n; j++) h[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); Dataparallel = synchronous Corresponds to Jacobi iteration HPC Fall 2010 8

OpenMP Heat Distribution Problem #pragma omp parallel shared(g,h,limit,n) private(iter,i,j) for (iter = 0; iter < limit; iter++) #pragma omp for for (i = 0; i < n; i++) for (j = 0; j < n; j++) newh[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); #pragma omp for for (i = 0; i < n; i++) for (j = 0; j < n; j++) h[i][j] = newh[i][j]; Corresponds to Jacobi iteration HPC Fall 2010 9

Dataparallel Heat Distribution Red-black Ordering for (iter = 0; iter < limit; iter++) forall (i = 0; i < n; i++) forall (j = 0; j < n; j++) if ((i+j) % 2!= 0) h[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); forall (i = 0; i < n; i++) forall (j = 0; j < n; j++) if ((i+j) % 2 == 0) h[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); Dataparallel = synchronous HPC Fall 2010 10

OpenMP Heat Distribution Red-black Ordering #pragma omp parallel shared(h,limit,n) private(iter,i,j) for (iter = 0; iter < limit; iter++) #pragma omp for for (i = 0; i < n; i++) for (j = 0; j < n; j++) if ((i+j) % 2!= 0) h[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); #pragma omp for for (i = 0; i < n; i++) for (j = 0; j < n; j++) if ((i+j) % 2 == 0) h[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); HPC Fall 2010 11

Iterative Solver Jacobi iteration Pacheco Bertsekas and Tsitsiklis HPC Fall 2010 12

Iterative Solver: Jacobi Method for (i = 0; i < n; i++) x[i] = b[i]; for (iter = 0; iter < limit; iter++) for (i = 0; i < n; i++) sum = -a[i][i] * x[i]; for (j = 0; j < n; j++) sum = sum + a[i][j] * x[j]; new_x[i] = (b[i] - sum) / a[i][i]; for (i = 0; i < n; i++) x[i] = new_x[i]; Note: stopping criterium omitted HPC Fall 2010 13

Dataparallel Iterative Solver: Jacobi Method for (i = 0; i < n; i++) x[i] = b[i]; for (iter = 0; iter < limit; iter++) forall (i = 0; i < n; i++) sum[i] = -a[i][i] * x[i]; for (j = 0; j < n; j++) forall (i = 0; i < n; i++) sum[i] = sum[i] + a[i][j] * x[j]; forall (i = 0; i < n; i++) x[i] = (b[i] sum[i]) / a[i][i]; Dataparallel = synchronous (lock-step) Note: stopping criterium omitted HPC Fall 2010 14

Task Parallel Iterative Solver: Jacobi Method for each processor 0 < p < n private iter, sum, j x[p] = b[p]; for (iter = 0; iter < limit; iter++) sum = -a[p][p] * x[p]; for (j = 0; j < n; j++) sum = sum + a[p][j] * x[j]; barrier x[p] = (b[p] sum) / a[p][p]; HPC Fall 2010 15

Iterative Solver: Jacobi Method in OpenMP #pragma omp parallel shared(a,b,x,new_x,n) private(iter,i,j,sum) #pragma omp for for (i = 0; i < n; i++) x[i] = b[i]; for (iter = 0; iter < limit; iter++) #pragma omp for for (i = 0; i < n; i++) sum = -a[i][i] * x[i]; for (j = 0; j < n; j++) sum = sum + a[i][j] * x[j]; new_x[i] = (b[i] - sum) / a[i][i]; #pragma omp for for (i = 0; i < n; i++) x[i] = new_x[i]; HPC Fall 2010 16

OpenMP Iterative Solver Checking for Convergence #pragma omp parallel shared(a,b,x,new_x,n,notdone) for (iter = 0; iter < limit; iter++) #pragma omp for reduce( :notdone) private(sum,i,j) for (i = 0; i < n; i++) sum = 0; for (j = 0; j < n; j++) sum = sum + a[i][j] * x[j]; if (fabs(sum - b[i]) >= tolerance) notdone = 1; if (notdone == 0) break; Bertsekas and Tsitsiklis HPC Fall 2010 17

OpenMP Iterative Solver Gauss-Seidel Relaxation #pragma omp parallel shared(a,b,x,n,nt) private(iter,i,j,sum,tid,work,lo,hi,loc_x) #pragma omp single nt = omp_get_num_threads(); tid = omp_get_thread_num(); work = (n + nt-1) / nt; lo = work * tid; hi = lo + work; if (hi > n) hi = n; for (i = lo; i < hi; i++) x[i] = b[i]; #pragma omp flush(x) // need this? Departure from pure dataparallel model! for (iter = 0; iter < limit; iter++) #pragma omp barrier for (i = lo; i < hi; i++) sum = -a[i][i] * x[i]; for (j = 0; j < n; j++) if (j >= lo && j < i) sum = sum + a[i][j] * loc_x[j-lo]; else sum = sum + a[i][j] * x[j]; loc_x[i-lo] = (b[i] - sum) / a[i][i]; #pragma omp barrier for (i = lo; i < hi; i++) x[i] = loc_x[i-lo]; #pragma omp flush(x) // need this? HPC Fall 2010 18

Synchronous Computing with High-Performance Fortran High Performance Fortran (HPF) is an extension of Fortran 90 with constructs for parallel computing Dataparallel FORALL PURE (side-effect free functions) Directives for recommended data distributions over processors Library routines for parallel sum, prefix (scan), scattering, sorting, Uses the array syntax of Fortran 90 for as a dataparallel model of computation Spreads the work of a single array computation over multiple processors Allows efficient implementation on both SIMD and MIMD style architectures, shared memory and DSM But most users and vendors prefer OpenMP over HPF HPC Fall 2010 19

HPF!HPF$ PROCESSORS procname(dim1,,dimn)!hpf$ DISTRIBUTE array1(dist),,arraym(dist) ONTO procname Shaded area: P0 has X(1:2,1:8) Example:!HPF$ PROCESSORS pr(2,2) REAL X(8,8)!HPF$ DISTRIBUTE X(BLOCK,*) ONTO pr Image from DOE ANL HPC Fall 2010 20

HPF!HPF$ ALIGN array WITH target Example: REAL A(6),B(8),C(4) REAL D(6,6),E(6,6)!HPF$ ALIGN A(I) WITH B(I) Aligns arrays to target arrays to conform to target s data distribution over processors Image from DOE ANL HPC Fall 2010 21

HPF Heat Distribution Problem!HPF$ PROCESSORS pr(4) REAL h(100,100)!hpf$ DISTRIBUTE h(block,*) ONTO pr h(2:99,2:99) = 0.25*( h(1:98,2:99)+h(3:100,2:99) +h(2:99,1:98)+h(2:99,3:100)) Alternatively, with FORALL FORALL (i=2:99,j=2:99) h(i,j) = 0.25*( h(i-1,j)+h(i+1,j) +h(i,j-1)+h(i,j+1)) HPC Fall 2010 22

HPF Heat Distribution Problem Red-black Ordering!HPF$ PROCESSORS pr(4) REAL h(100,100)!hpf$ DISTRIBUTE h(block,*) ONTO pr FORALL (i=2:99, j=2:99, MOD(i+j,2).EQ.0) h(i,j) = 0.25*(h(i-1,j)+h(i+1,j)+h(i,j-1)+h(i,j+1)) FORALL (i=2:99, j=2:99, MOD(i+j,2).EQ.1) h(i,j) = 0.25*(h(i-1,j)+h(i+1,j)+h(i,j-1)+h(i,j+1)) HPC Fall 2010 23

HPF Iterative Solver: Jacobi Method!HPF$ PROCESSORS pr(4) REAL a(100,100),b(100),x(100)!hpf$ ALIGN x(:) WITH a(*,:)!hpf$ ALIGN b(:) WITH x(:)!hpf$ ALIGN s(:) WITH x(:)!hpf$ DISTRIBUTE a(block,*) ONTO pr x = b FORALL (i = 1:n) s(i) = SUM(a(i,:)*x(:)) DO iter = 0,limit FORALL (i = 1:n) x(i) = (b(i) - s(i) + a(i,i)*x(i)) / a(i,i) FORALL (i = 1:n) s(i) = SUM(a(i,:)*x(:)) IF (MAXVAL(ABS(s - b)) < tolerance) EXIT ENDDO HPC Fall 2010 24

Gaussian Elimination The original system of equations is reduced to an upper triangular form Ux = y where U is a matrix of size N N in which all elements below the diagonal are zero, and diagonal elements have the value 1 Back substitution: the new system of equations is solved to obtain the values of x Image from DOE ANL See: http://www-unix.mcs.anl.gov/dbpp/text/node82.html HPC Fall 2010 25

HPF Gaussian Elimination 1 REAL A(n,n+1), X(n), Fac(n), Row(n+1) INTEGER indx(n), itmp(1), max_indx, i, j, k!hpf$ ALIGN Row(j) WITH A(1,j)!HPF$ ALIGN X(i) WITH A(i,N+1)!HPF$ DISTRIBUTE A(*,CYCLIC) indx = 0 DO i = 1,n itmp = MAXLOC(ABS(A(:,i)), MASK=indx.EQ.0)! Stage 1 max_indx = itmp(1)! Stage 2 indx(max_indx) = i Fac = A(:,i) / A(max_indx,i)! Stage 3+4 Row = A(max_indx,:) FORALL (j=1:n, k=i:n+1, indx(j).eq.0)! Stage 5 A(j,k) = A(j,k) - Fac(j)*Row(k) ENDDO! Row exchange FORALL (j=1:n) A(indx(j),:) = A(j,:)! Backsubstitution, uses B(:) stored in A(1:n,n+1) DO j = n,1,-1 X(j) = A(j,n+1) / A(j,j) A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j) ENDDO HPC Fall 2010 26

HPF Gaussian Elimination 2 Computing the upper triangular form takes five stages: 1. Reduction with MAXLOC 2. Broadcast (copy) max_indx 3. Compute scale factors Fac 4. Broadcast scale factor Fac and pivot row value Row(k) 5. Row update with FORALL Image from DOE ANL HPC Fall 2010 27