Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Size: px

Start display at page:

Download "Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method"

Austin Malone
5 years ago
Views:

Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

1 Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente HPC Advisory Council Workshop, Stanford, CA, February 2018

2 Outline Introduction and Motivation Solver Details GPU implementation in CUDA Fortran Benchmarking and Results Conclusions

3 Introduction and Motivation Increased availability of GPU compute resources: Explosion of interest in Machine Learning Focus on energy efficiency for exascale Lots of choices to make: OpenACC vs. CUDA CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools Talk is focused on getting up and running with low-effort.

4 Solver Details

5 Solver Details Incompressible CFD solver for DNS computations in structured domains IB + structural solver using method described in [1] Immersed interface contributes forcing term to fluid Interface structural dynamics treated as triangulated network of springs [1] Spandan et al., Journal of Computational Physics, 2017

6 Solver Details Initialize Solver Compute RK step RK Loop Timestep Loop Compute IB forcing term Structural update

7 GPU Implementation in CUDA Fortran

8 CUDA Fortran Baseline CPU code is written in Fortran so natural choice for GPU port is CUDA Fortran. Benefits: More control than OpenACC: Explicit GPU kernels written natively in Fortran are supported Full control of host/device data movement Directive-based programming available via CUF kernels Easier to maintain than mixed CUDA C and Fortran approaches Requires PGI compiler (community edition available now for free)

9 Profiling with NVPROF + NVVP + NVTX NVPROF: Can be used to gather detailed kernel properties and timing information NVIDIA Visual Profiler (NVVP): Graphical interface to visualize and analyze NVPROF generated profiles Does not show CPU activity out of the box NVIDIA Tools EXtension (NVTX) markers: Enables annotation with labeled ranges within program Useful for categorizing parts of profile to put activity into context Can be used to visualize normally hidden CPU activity (e.g. MPI communication)

10 NVIDIA Visual Profiler with NVTX Markers

11 GPU Porting of Key Computational Routines In many CFD (and similar) codes, common code patterns appear: Tightly-nested loop computations (computation of derivatives using stencils) Common mathematical computations (Fourier transforms, matrix-algebra) But there are also unique patterns specific to a given application: Computation of IB forcing on flow field Computation of interface structural forces

12 Case 1: Tightly-nested loops Consider the original CPU subroutine to compute the divergence. subroutine divg use param use local_arrays, only: q1, q2, q3,& dph, jpv, ipv,& udx3m... do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc) dph(ic,jc,kc) = dqcap*usdtal end subroutine divg

13 Case 1: Tightly-nested loops Now, consider the version for GPU using CUF kernel directives. subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d...!$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc) dph(ic,jc,kc) = dqcap*usdtal end subroutine divg

14 Case 1: Tightly-nested loops CUF kernel directive automatically generates GPU kernels for tightly nested loops. Scalar data passed by value to device. Array data must already be resident on device. subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d...!$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc) dph(ic,jc,kc) = dqcap*usdtal end subroutine divg

15 Case 1: Tightly-nested loops For getting data onto the device, CUDA Fortran allows for straightforward declaration/allocation of device data. subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d... module local_arrays!$cuf kernel do(3) real(8), allocatable :: q1(:,:,:) do kc = kstart,kend real(8), device, allocatable :: q1_d(:,:,:) do jc = 1,n2m... do ic = 1,n1m end module local_arrays kp = kc+1; jp = jpv(jc); ip = ipv(ic) allocate(q1(nx,ny,nz)); q1 = -0.d0 dqcap = (q1(ip,jc,kc) q1(ic,jc,kc)) * dx1 & allocate(q1_d(nx,ny,nz); q1_d- =q2(ic,jc,kc)) q1 +(q2(ic,jp,kc) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc) Alternative using sourced allocation: allocate(q1_d, source= =dqcap*usdtal q1) dph(ic,jc,kc) end subroutine divg

16 Additional CUF kernel features CUF kernels can be used to perform reductions of scalar device data. subroutine integer, real(8), real(8),... calculate_volume_gpu (Volume,nv,nf,xyz,vert_of_face) dimension (3,nf), device, intent(in) :: vert_of_face dimension (nv,3), device, intent(in) ::xyz intent(out) :: Volume Volume = 0.d0 Final reduced result can be on the host or device.!$cuf kernel do (1) do i = 1,nf v1 = vert_of_face(1,i) v2 = vert_of_face(2,i) v3 = vert_of_face(3,i) x1 = xyz(v1,1); x2 = xyz(v2,1); x3 = xyz(v3,1) y1 = xyz(v1,2); y2 = xyz(v2,2); y3 = xyz(v3,2) z1 = xyz(v1,3); z2 = xyz(v2,3); z3 = xyz(v3,3) Volume = Volume + (x1 * (y2*z3 - z2*y3) + & x2 * (y3*z1 - z3*y1) + & x3 * (y1*z2 - z1*y2))/6.d0 end subroutine calculate_volume_gpu

17 Case 2: Common Mathematical Computations Beyond loop-based computations, many codes use common math computations for which there are GPU libraries readily available: FFT: CUFFT BLAS: CUBLAS Linear Algebra: CUSOLVER Use wisely: Favor batched implementations when available, avoid many repeated calls of small operations

18 Case 2: Common Mathematical Computations Consider the original CPU code for completing a real-to-complex FFT using FFTW library. coefnorm = 1.d0/(dble(n1m) * dble(n2m)) do k = kstart,kend do j = 1,n2m do i = 1,n1m xr(j,i) = dph(i,j,k) call dfftw_execute_dft_r2c(fwd_plan,xr,xa) do j = 1,n2m/2 + 1 do i = 1,n1m dpho(i,j,k) = dreal(xa(j,i)) * coefnorm dpho(i,j+n2mh,k) = dimag(xa(j,i)) * coefnorm end do

19 Case 2: Common Mathematical Computations Now consider the version for GPU using CUFFT library. Modified to use batched 2D FFTs Final loop merged with later packing loop kernel fusion coefnorm = 1.d0/(dble(n1m) * dble(n2m))!$cuf kernel do (3) do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) istat = cufftexecd2z(cufft_fwd_plan, xr_d, xa_d)!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

20 Case 2: Common Mathematical Computations Now consider the version for GPU using CUFFT library. Modified to use batched 2D FFTs Final loop merged with later packing loop kernel fusion coefnorm = 1.d0/(dble(n1m) * dble(n2m))!$cuf kernel do (3) do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) istat = cufftexecd2z(cufft_fwd_plan, xr_d, xa_d) integer :: cufft_fwd_plan!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer :: rank(2), inembed(2), onembed(2)! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! rank(1) = n1m; rank(2) = n2m inembed(1) = n1m; inembed(2) = n2m onembed(1) = n1m; onembed(2) =n2m/2 + 1 istat = cufftplanmany(cufft_fwd_plan, 2, rank, inembed, 1, & n1m*n2m, onembed, 1, n1m*(n2m/2 + 1),& CUFFT_D2Z, kend-kstart+1)

21 Interfaces for BLAS routines PGI provides overloaded interfaces for BLAS routines. Calls with device-resident arrays are automatically passed to the CUBLAS library. use cudafor use cublas integer :: m, n, k real(8) :: alpha, beta real(8) :: a(m,k), b(k,n), c(m,n) real(8),device :: a_d(m,k), b_d(k,n), c_d(m,n)...! DGEMM using linked CPU library call DGEMM( N, N, m, n, k, alpha, a, m, b, k, & beta, c, m)! DGEMM using CUBLAS call DGEMM( N, N, m, n, k, alpha, a_d, m, b_d, k, & beta, c_d, m)

22 Case 3: Unique computations The need for custom kernels arises in most programs: Unique computations not amenable to a CUF kernel Common mathematical operation, but no good GPU library implementation: Tridiagonal LU factorization/solves with multiple RHS Pattern of library usage that would be poor performing on GPU: Data interpolation from flow grid to structural grid involves many small matrix and vector computations.

23 Example 1: Batched Tridiagonal Solver Flow solver requires tridiagonal LU factorization/solves with multiple RHS Wrote batched tridiagonal solver using Thomas algorithm One GPU thread assigned per RHS To ensure coalesced access of RHS values by threads, data transposition required: rhs_d(1:n1*n2, 1:NRHS) rhs_t_d(1:nrhs, 1:N1*N2)

24 Example 2: Data Interpolation Between Grids This is the most time consuming operation in the IBM portion of the solver. Goal is to compute interpolated value on structural grid from flow grid.

25 Example 2: Data Interpolation Between Grids For a given triangle i: Form 27-point support domain around triangle centroid. Compute transfer function, using support point and centroid data. Final centroid result scattered back to support points or to triangle vertices.

26 Example 2: Data Interpolation Between Grids For a given triangle i: Form 27-point support domain around triangle centroid. Compute transfer function, using support point and centroid data. Final centroid result scattered back to support points or to triangle vertices.

27 Example 2: Data Interpolation Between Grids Computation of transfer function for each triangle requires: 4 x 4 matrix inversion Several small matrix-vector multiplies: [1 x 4][4 x 4] and [1 x 4][4 x 27] Final computation of product of 27 values. is an inner

28 Example 2: Data Interpolation Between Grids GPU strategy: Process each triangle using a warp (32 thread unit), map threads to support points Data is warp-local most matrix algebra can be completed efficiently using warp shuffle intrinsics. Scattering of final result completed using atomic adds.

29 Benchmarking and Results

30 Verification Case

31 Benchmarking Case Unit cube, quiescent flow N = 128, 256, 384 # of Particles = 1, 8, 27, 64 Particle Resolution= 1280, 5120, triangles Run on: 1x 16-core Intel(R) Xeon(R) CPU E GHz 1x NVIDIA Tesla V100 PCIe

32 Grid Resolution Fixed # of Particles = 8 Particle Resolution = 5120 triangles Fluid: 10 to 14x speedup vs. CPU IB + Structural: 40 to 100x speedup vs. CPU Percentage of time: CPU: 72% to 14% GPU: 20% to 6%

33 Particle Resolution Fixed N = 256 Fixed # of Particles = 8 IB + structural solver time increases at reduced rate on GPU: CPU: 15% to 55% GPU: 6% to 13%

34 Number of Particles Fixed N = 256 Particle Resolution = 5120 triangles IB + Structural solver time increases at similar rates: CPU: 14% to 59% GPU: 5% to 22%

36 Conclusions

37 Conclusions Porting research codes to GPUs is worth the investment Faster runtimes enable larger cases, more rapid experimentation Large performance gains can be achieved with low effort using CUDA Fortran CUF kernel directives CUDA-enabled libraries Custom kernels when all else fails Working with developers to apply current code to challenging research cases Some previous work with these developers can be found on GitHub:

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January