GPU Programming Paradigms

Size: px

Start display at page:

Download "GPU Programming Paradigms"

Lilian Rice
6 years ago
Views:

1 GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke ( ) 1

2 Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows 7, Nvidia GeForce GT 220 hpc-orlando: Windows 7, ATI Radeon 4650 Intel GHz, 8 GiB RAM Upcoming: Tesla S1070 1U Rack Box: 4 Tesla T10 GPUs, 16 GiB RAM Tesla S20x0 1U Rack Box: 4 Fermi GPUs, 12/24 GiB RAM Connected to Nehalem Servers: 2 x Intel X5570@2.93 GHz Future: Visualization Cluster powering the new CAVE, Fermi graphics Available for HPC in batch mode at night/weekends 2

3 CUBLAS & CUFFT BLAS (Basic Linear Algebra Subprograms) and FFT (Fast Fourier Transform): popular interfaces / algorithms for numerical computation CUBLAS / CUFFT offload the computation onto the GPU, but are available as headers + libraries to be used like usual Potential to offload computation and harness GPU power without having to re-structure your program and re-code you algorithms in a painful way Nvidia documentation available for both libraries If you make heavy use of BLAS or FFT => try it out! 3

4 PGI: Overview Portland Group: Compiler: pgcc, pgfortran GPU activities CUDA Fortran PGI Accelerator Programming Model Works only on Nvidia GPUs Commercial Compiler Product (License needed) Available on our Linux cluster in several versions, including the most recent (10.3) 4

5 PGI CUDA FORTRAN 5

6 Overview CUDA is an architecture and a programming model, accessible to C programmers via C for CUDA, a C extension Likewise: CUDA Fortran makes the CUDA model accessible to Fortran programmers Same level of abstraction Compiler support necessary: CUDA Fortran is an extension to Fortran Developed by The Portland Group and only available with the PGI compilers 6

7 Example: Overall Structure module calcpi_mod use cudafor contains real attributes(device) function f(a) [ ] end function f attributes(global) subroutine partialsum (n, nbrthreads, res) [ ] end subroutine partialsum real function calcpi (n, blockspergrid, threadsperblock) [ ] end function CalcPi end module calcpi_mod progam Pi use calcpi_mod [ ] 7

8 Example: Kernel code attributes(global) subroutine partialsum (n, nbrthreads, res) implicit none integer, intent(in), value :: n, nbrthreads real, intent(out), device :: res(nbrthreads) real :: fh, fsum, fx integer :: i, idx fh = 1.0 / real(n) fsum = 0.0 idx = blockdim%x * (blockidx%x - 1 ) + threadidx%x do i = idx - 1, n - 1, nbrthreads fx = fh * (real(i) + 0.5) fsum = fsum + f(fx) res(idx) = fh * fsum end subroutine partialsum 8

9 Example: Kernel call [ ] real, allocatable, dimension(:), device :: result_dev [ ] allocate( result_dev(nbrthreads), stat = status ) [ ] call partialsum<<<blockspergrid, threadsperblock>>>(n, nbrthreads, result_dev) r = cudathreadsynchronize() ierr = cudagetlasterror() if (ierr /= 0) then write (*,"(A, ' ', A)") 'CUDA Error: ', cudageterrorstring (ierr) stop end if [ ] deallocate( result_dev ) 9

10 Subprogram Qualifiers Host subprogram, attributes(host): function or subroutine, can only be called from another host subprogram, default attribute Device subprogram, kernel subroutine, attributes(global): only subroutine, may only be called from host subprogram using chevron call syntax Device subprogram, attributes(device): function or subroutine, only callable from device subprogram Certain restrictions apply to device subprograms: device subprograms not recursive no assumed-shape arrays as dummy arguments 10

11 Variable Qualifiers Device global memory (device): use modules to share between host and device subprograms Constant memory space (constant): use modules to share between host and device subprograms, may be modified only in host subprograms, may not be allocatable Device shared memory (shared): only in device subprogram, shared by all threads in a block Page-locked memory on host (pinned): must be an allocatable array 11

12 Fortran Specifics Attributed variable declarations: device, constant, shared, pinned; use allocate for dynamic memory allocation (Implicit) Data transfer: A = Adev or Adev = A or Adev = Bdev B = A + Adev C = A * Adev + B Computation done on host Fortran intrinsics in device subprograms: A limited number of standard intrinsics available New intrinsics: synthreads, gpu_time, 12

13 Build and Run CUDA Fortran Code Use the PGI Compiler: module switch intel pgi Filename suffix:.cuf or.cuf Explicitly enable CUDA Fortran: -Mcuda Emulation mode: -Mcuda=emulate 13

14 PGI ACCELERATOR PROGRAMMING MODEL 14

15 Overview Usable for C and Fortran Directives like OpenMP C: #pragma acc <directive-name> [<clause>] Fortran:!$acc <directive-name> [<clause>] OpenMP PGI Accelerator!$omp parallel do private(tmp) do i = 1, n tmp = 2.0 * x(i) y(i) = tmp * tmp!$acc region do do i = 1, n tmp = 2.0 * x(i) y(i) = tmp * tmp 15

16 Getting Started Test connection to GPU pgaccelinfo Compilation pgfortran ta=nvidia Minfo=accel a.f90 -ta=nvidia: build for (Nvidia) GPU -Minfo=accel: enable compiler feedback Compilation generates Host x86 Code & GPU/Accelerator Code Print message when kernel is executed export ACC_NOTIFY=1 16

17 Excerpt of Jacobi Example do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0! Copy new solution into old uold = afu! Compute stencil, residual, & update do j = 1, irows 2 do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) (linuxc7, Intel Core2Quad Q9400, PGI 10.2, Matrix 5000x5000) ~ 1460 MFlops single precision! 17

18 Jacobi & PGI Acc: 1st try do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do do j = 1, irows 2 do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) Compute Region Directive Loop Mapping Directive (linuxc7, Nvidia GeForce GT220, PGI 10.2, Matrix 5000x5000) ~ 1069 MFlops 18

19 Jacobi & PGI Acc: Compiler Feedback 59, Loop not vectorized/parallelized: multiple exits 68, Generating copyin(aff(1:icols-2,1:irows-2)) Generating copyin(afu(0:icols-1,0:irows-1)) Generating copyout(afu(1:icols-2,1:irows-2)) Generating copyout(uold(0:icols-1,0:irows-1)) 69, Loop is parallelizable Accelerator kernel generated 69,!$acc do parallel, vector(16) 73, Loop is parallelizable 74, Loop is parallelizable Accelerator kernel generated 73,!$acc do parallel, vector(16) 74,!$acc do parallel, vector(16) Cached references to size [18x18] block of 'uold' 84, Sum reduction generated for residual 19

20 Jacobi & PGI Acc: 2nd try do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region local(uold), copy(afu), copyin(aff) Data Copy clauses: copyout,! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do do j = 1, irows 2 do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) ~ 1230 MFlops 20

21 Jacobi & PGI Acc: Compiler Feedback 59, Loop not vectorized/parallelized: multiple exits 68, Generating copyin(aff(:,:)) Generating copy(afu(:,:)) Generating local(uold(:,:)) 69, Loop is parallelizable Accelerator kernel generated 69,!$acc do parallel, vector(16) 73, Loop is parallelizable 74, Loop is parallelizable Accelerator kernel generated 73,!$acc do parallel, vector(16) 74,!$acc do parallel, vector(16) Cached references to size [18x18] block of 'uold' 84, Sum reduction generated for residual 21

22 Jacobi & PGI Acc: 3rd try do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region local(uold), copy(afu), copyin(aff)! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do parallel Loop Scheduling: parallel clause (doall parallelism) do j = 1, irows 2!$acc do vector(256) Loop Scheduling: vector clause (synchronous parallelism) do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) ~ 1570 MFlops 22

23 Jacobi & PGI Acc: Compiler Feedback 59, Loop not vectorized/parallelized: multiple exits 68, Generating copyin(aff(:,:)) Generating copy(afu(:,:)) Generating local(uold(:,:)) 69, Loop is parallelizable Accelerator kernel generated 69,!$acc do parallel, vector(16) 73, Loop is parallelizable 75, Loop is parallelizable Accelerator kernel generated 73,!$acc do parallel 75,!$acc do vector(256) Cached references to size [258x3] block of 'uold' 85, Sum reduction generated for residual 23

24 Jacobi & PGI Acc: CUDA Profiler 24

25 25 Jacobi & PGI Acc: 4th try!$acc data region local(uold), copy(afu), copyin(aff) Data Region Directive do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do parallel do j = 1, irows 2!$acc do vector(256) do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows)!$acc end data region ~ 3550 MFlops

26 Jacobi & PGI Acc: Compiler Feedback 59, Generating local(uold(:,:)) Generating copyin(aff(:,:)) Generating copy(afu(:,:)) 60, Loop not vectorized/parallelized: multiple exits 70, Loop is parallelizable Accelerator kernel generated 70,!$acc do parallel, vector(16) 74, Loop is parallelizable 76, Loop is parallelizable Accelerator kernel generated 74,!$acc do parallel 76,!$acc do vector(256) Cached references to size [258x3] block of 'uold' 86, Sum reduction generated for residual 26

27 Jacobi & PGI Acc: CUDA Profiler 27

28 PGI 10.1 vs 10.2/10.3 [ ]!$acc region! Copy new solution into old uold = afu [ ] 66, Memory copy idiom, array assignment replaced by call to pgf90_mcopy4 => Copy operation performed on host CPU 10.1 ~ 4180 MFlops 10.2 / , Loop is parallelizable Accelerator kernel generated 66,!$acc do parallel, vector(16) => Copy operation performed on GPU ~ 3550 MFlops Performance on Tesla PGI: 9651 Mflops (10.1) vs (10.2) 28

29 Best Version Yet Replace [ ]!$acc region! Copy new solution into old uold = afu [ ] [ ]!$acc region! Copy new solution into old!$acc do parallel do j = 0, irows - 1!$acc do vector(256) do i = 0, icols - 1 uold(i,j) = afu(i,j) [ ] By 10.1 ~ 6070 MFlops 10.2 / 10.3 ~ 3800 MFlops 29

30 Summary for Jacobi Example Data movement between Host and Accelerator Use data copy clauses (local, copy, ) Copy whole arrays Use a data region Parallelism on accelerator Try different loop scheduling (e.g. doall and synchronous parallelism: parallel, vector) Try to make width in vector(width) directives a multiple of 32 to match Nvidia CUDA warp size PGI Links

31 Performance Considerations Data movement between Host and Accelerator Minimize amount, number and frequency Maximize bandwidth Optimize data allocation in device memory Parallelism on accelerator Lots of MIMD parallelism to fill multiprocessors Lots of SIMD parallelism to fill cores on a multiprocessor Lots more MIMD parallelism Data movement between device memory and cores Minimize frequency Optimize strides: stride-1 in vector dimension Optimize alignment: 16-word aligned in vector dimension Store array blocks in data cache (CUDA shared memory) 31

32 BACKUP SLIDES 32

33 PGI Accelerator: Compiler Flags pgfortran ta=nvidia,time Minfo=accel a.f90 Links in a timer library: collects and prints out simple timing information about the accelerator regions and generated kernels jacobi 59: region entered 1 time time(us): total= init= region= kernels= data= w/o init: total= max= min= avg= : kernel launched 20 times grid: [313x313] block: [16x16] time(us): total= max=16391 min=15497 avg= : kernel launched 20 times grid: [4998] block: [256] time(us): total= max=67314 min=66964 avg= : kernel launched 20 times grid: [1] block: [256] time(us): total=587 max=36 min=27 avg=29 33

34 PGI Accelerator: Runtime library routines Fortran: module accel_lib C: accel.h acc_get_num_devices (devicetype) acc_set_device (devicetype) acc_init (devicetype) Initialise runtime for device, e.g. for isolating initialisation cost from computational cost acc_shutdown (devicetype) 34

35 PGI Accelerator: Compiler Flags pgfortran ta=nvidia,cc11 Minfo=accel a.f90 Generates code for compute capability 1.1 Compute capability depends on graphics card pgfortran ta=nvidia,keepgpu Minfo=accel a.f90 Keeps the kernel source files 35

36 More Information PGI User s Guide CUDA Fortran: PGI Accelerator: PGI User Forum: 36

37 THANK YOU FOR YOUR ATTENTION! 37

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program: