Progress Porting WRF to GPU using OpenACC

Size: px

Start display at page:

Download "Progress Porting WRF to GPU using OpenACC"

Elijah Golden
5 years ago
Views:

1 Progress Porting WRF to GPU using OpenACC Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology Alexey Romanenko Alexey Snytnikov NVIDIA OpenSource Applications

2 WRF Porting Plan WRF developers interested in GPU port if performance is suitable. OpenACC extensions are acceptable because: This is an open standard, unlike CUDA Fortran Minimal changes, like OpenMP, unlike OpenCL Changes to WRF modules will need to be negotiated with developers in order to provide support. Tradeoff is extent of changes versus performance gain.

3 Status So Far OpenACC, minimal rewrite Updated changes to WRF (latest) bugfix release Physics Models: Thompson: 2x Morrison: 4.5x Kessler: 2x Speedups for 1 proc, 1 core versus 1 core + 1 GPU Measured in isolation of the rest of the WRF Still working on the Dynamics to get end-to-end Need scaling to more cores sharing GPU

4 WRF Parallel Performance Issues WRF has a flattish profile. By Amdahl's Law, speeding up any particular loop doesn't help much. Loops tend NOT to re-process data, so moving arrays host GPU costs more than parallel processing would save. Overall speedup requires speedup of many loops. Also requires keeping data resident on GPU for the bulk of the computation. Similar issue with HYCOM and other codes.

5 WRF Re-Coding Issues Large code, almost 1 million lines Slow compiles 30 minutes on PGI, 8 hours on Cray Fortran with modules Subroutines with thousands of lines Makes manual analysis difficult Matching data update and present directives Matching data & kernel/parallel directives with compound statements

6 WRF Profile (VampirTrace) 1 Core (PSG/IvyBridge) excl. time incl. time calls name s s zolri2_ s s psim_unstable_ s s psih_unstable_ s s 920 module_advect_em_advect_scalar_ s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s 400 module_advect_em_advect_scalar_pd_ s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s 280 module_small_step_em_advance_w_ s s module_ra_rrtm_rtrn_ s s zolri_ s s 120 module_big_step_utilities_em_ horizontal_pressure_gradient_gpu_ s s psih_stable_ s s 280 module_small_step_em_advance_uv_ s s psim_stable_ s s module_bl_ysu_ysu2d_ s s 1200 module_em_rk_update_scalar_ s s 5120 module_big_step_utilities_em_zero_tend_gpu_ s s module_sf_sfclayrev_sfclayrev1d_ s s 280 module_small_step_em_advance_mu_t_ s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ s s 40 module_pbl_driver_pbl_driver_ s s 520 module_big_step_utilities_em_horizontal_diffusion_ s 3.623s 400 module_em_rk_update_scalar_pd_ 0.278s s 40 solve_em_ ms s 1200 module_em_rk_scalar_tend_ ms s 120 module_em_rk_tendency_ 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ Total time ~3600s under VampirTrace

7 OpenACC Speedups & Slowdowns 1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time s s zolri2_ s s s s psim_unstable_ s s s s psih_unstable_ s s s s 920 module_advect_em_advect_scalar_ s s s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s s s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s s s 280 module_small_step_em_advance_w_ 7.896s 7.896s s s module_ra_rrtm_rtrn_ s s s s zolri_ s s s s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ s s psih_stable_ s s s s 280 module_small_step_em_advance_uv_ 2.003s 2.003s s s psim_stable_ s s s s module_bl_ysu_ysu2d_ s s s s 1200 module_em_rk_update_scalar_ 2.218s 2.218s s s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s s s module_sf_sfclayrev_sfclayrev1d_ s s s s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s s s 40 module_pbl_driver_pbl_driver_ s s s s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s s 3.623s 400 module_em_rk_update_scalar_pd_ s s 0.278s s 40 solve_em_ s s ms s 1200 module_em_rk_scalar_tend_ s s ms s 120 module_em_rk_tendency_ s s 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ s s Total time ~3600s under VampirTrace

8 Cumulative Speedups 1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time s s zolri2_ s s s s psim_unstable_ s s s s psih_unstable_ s s s s 920 module_advect_em_advect_scalar_ s s s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s s s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s s s 280 module_small_step_em_advance_w_ 7.896s 7.896s s s module_ra_rrtm_rtrn_ s s s s zolri_ s s s s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ s s psih_stable_ s s s s 280 module_small_step_em_advance_uv_ 2.003s 2.003s s s psim_stable_ s s s s module_bl_ysu_ysu2d_ s s s s 1200 module_em_rk_update_scalar_ 2.218s 2.218s s s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s s s module_sf_sfclayrev_sfclayrev1d_ s s s s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s s s 40 module_pbl_driver_pbl_driver_ s s s s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s s 3.623s 400 module_em_rk_update_scalar_pd_ s s 0.278s s 40 solve_em_ s s ms s 1200 module_em_rk_scalar_tend_ s s ms s 120 module_em_rk_tendency_ s s 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ s s Total time ~3600s under VampirTrace

9 WRF GPU Profile (nvprof) Time (s) Name custreamsynchronize rk_update_scalar_pd_1493_gpu morr_two_moment_micro_3412_gpu [CUDA memcpy DtoH] [CUDA memcpy HtoD] cueventsynchronize advance_w_1553_gpu morr_two_moment_micro_1365_gpu mp_morr_two_moment_799_gpu spec_bdytend_gpu_2241_gpu advance_w_1628_gpu culaunchkernel Total time ~1800 s under nvprof.

10 WRF GPU Profile (nvprof) Time (s) Name custreamsynchronize rk_update_scalar_pd_1493_gpu ---> <1 second morr_two_moment_micro_3412_gpu [CUDA memcpy DtoH] [CUDA memcpy HtoD] cueventsynchronize advance_w_1553_gpu morr_two_moment_micro_1365_gpu mp_morr_two_moment_799_gpu spec_bdytend_gpu_2241_gpu advance_w_1628_gpu culaunchkernel Total time ~1800 s under nvprof.

11 Loop Speedups So far any loop we accelerate goes to near-zero time Usually don't have to restructure the loops, just add OpenACC annotations Will revisit this once we have a broader speedup

12 GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for CPU execution: 1058 DO j = j_start, j_end 1059 DO k=kts,ktf 1060 DO i = i_start, i_end 1061 mrdx=msfux(i,j)*rdx! ADT eqn 44, 1st term on RHS 1062 tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & 1063 *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & (ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) 1065 ENDDO 1066 ENDDO 1067 ENDDO Fortran arrays are stored in column-major order Each inner iteration scans consecutive entries in the same cache line Each outer iteration processes a sequence of cache lines, with no repetition For OpenMP parallelism, you will tend to split at the outer loop and let the inner loops proceed as before.

13 GPU-coordination versus CPU-Cache Striding k-stride Adjacent threads in (i,j) warp process adjacent elements of kth column i - adjacency (i,j) threads jth Matrix Block from (i,k,j)

14 GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for GPU execution:!$acc parallel!$acc loop collapse(2) DO j = j_start, j_end DO i = i_start, i_end DO k=kts,ktf mrdx=msfux(i,j)*rdx! ADT eqn 44, 1st term on RHS tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & -(ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) ENDDO ENDDO ENDDO!$acc end parallel Iterations of outer loops are mapped to GPU threads Adjacent threads in Warp are used to process consecutive elements in same column Sequential inner loop processes consecutive columns with coordinated threads

15 Loop/Conditional Interchange 3072 j_loop_y_flux_6 : DO j = j_start, j_end IF( (j >= j_start_f ).and. (j <= j_end_f) ) THEN! use full stencil DO k=kts,ktf 3077 DO i = i_start, i_end 3078 vel = rv(i,k,j) 3079 fqy( i, k, jp1 ) = vel*flux6( & 3080 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3081 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3082 ENDDO 3083 ENDDO ELSE IF ( j == jds+1 ) THEN! 2nd order flux next to south boundary ENDDO j_loop_y_flux_ j_loop_y_flux_6 : DO k=kts,ktf 3092 DO i = i_start, i_end 3093 fqy_1 = fqy_2 = DO j = j_start, j_end IF( (j >= j_start_f ).and. (j <= j_end_f) ) THEN! use full stencil 3103 vel = rv(i,k,j) 3104 fqy_1 = vel*flux6( & 3105 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3106 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3107 ELSE IF ( j == jds+1 ) THEN! 2nd order flux next to south boundary ENDDO ENDDO 3147 ENDDO j_loop_y_flux_6

16 Horror Loop (dyn_em/module_advect_em.f) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455!DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & min(0.,fqx (i,k,j)) ) & rdy*( max(0.,fqy (i,k,j+1)) & min(0.,fqy (i,k,j )) ) ) & msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & max(0.,fqz (i,k,j)) ) ) IF( flux_out.gt. ph_low ) THEN scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j).gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i,k,j).lt. 0.) fqx(i,k,j) = scale*fqx(i,k,j) 7478 IF( fqy (i,k,j+1).gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ).lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480! note: z flux is opposite sign in mass coordinate because 7481! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j).lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k,j).gt. 0.) fqz(i,k,j) = scale*fqz(i,k,j) END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO

17 Horror Loop (dyn_em/module_advect_em.f) 7450!$acc kernels 7451!$acc loop independent collapse(3) private(ph_low,flux_out) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455!DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & min(0.,fqx (i,k,j)) ) & rdy*( max(0.,fqy (i,k,j+1)) & min(0.,fqy (i,k,j )) ) ) & msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & max(0.,fqz (i,k,j)) ) ) IF( flux_out.gt. ph_low ) THEN scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j).gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i,k,j).lt. 0.) fqx(i,k,j) = scale*fqx(i,k,j) 7478 IF( fqy (i,k,j+1).gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ).lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480! note: z flux is opposite sign in mass coordinate because 7481! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j).lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k,j).gt. 0.) fqz(i,k,j) = scale*fqz(i,k,j) END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO 7490!$acc end kernels

18 Mirroring the Record-Structure TYPE( domain ), POINTER :: grid real,dimension(grid%sm31:grid%em31,grid%sm33:grid%em33) ALLOCATE(grid%msfux(sm31:em31,sm33:em33),STAT=ierr) :: msfux real,dimension(:,:),pointer :: msfux_pt CALL alloc_gpu_extra2d(grid%msfux,msfux_pt) SUBROUTINE alloc_gpu_extra1d(grid_pt,pt) real,dimension(:),pointer :: pt real,dimension(:),pointer :: grid_pt!$acc enter data create(grid_pt) pt => grid_pt END SUBROUTINE alloc_gpu_extra1d

19 OpenACC Original Intent Well-bracketed regions Position data on GPU, process it, save result & discard the rest:!$acc data create (. ) &!$acc. call subroutine!$acc end data copy (..) Finding instead that we're swapping arrays between Host GPU memory at random points

20 Partial Port GPU Data Placement Need to reduce amount of data movement between Host GPU Requires porting more code, so it can operate on GPU-resident data Explicit swaps have to be used when moving from GPU-code to host-code.!$acc update device(...)!$acc update host(...)

21 GPU Data Placement (WRF) wrf_run -> integrate -> solve_interface -> solve_em (DATA PLACEMENT HERE) -> microphysics_driver -> mp_gt_driver -> first_rk_step_part1 -> first_rk_step_part2 -> rk_scalar_tend -> advect_scalar_pd

22 Mixing OpenACC & CPU Code In some cases it may be clearer to interleave the two in the same file: #ifdef _OPENACC! GPU-optimized loops #else! CPU-optimized loops #endif In other cases it may be clearer to duplicate the file and use the build-process to select them: module_advect_em.f module_advect_em.openacc.f You can view the changes side-by-side with sdiff.

Weather & Ocean Code Restructuring for OpenACC. Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology

Weather & Ocean Code Restructuring for OpenACC Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology Typical Application Structure Some interleaving of loops with complex sequential