Progress Porting WRF to GPU using OpenACC
|
|
- Elijah Golden
- 5 years ago
- Views:
Transcription
1 Progress Porting WRF to GPU using OpenACC Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology Alexey Romanenko Alexey Snytnikov NVIDIA OpenSource Applications
2 WRF Porting Plan WRF developers interested in GPU port if performance is suitable. OpenACC extensions are acceptable because: This is an open standard, unlike CUDA Fortran Minimal changes, like OpenMP, unlike OpenCL Changes to WRF modules will need to be negotiated with developers in order to provide support. Tradeoff is extent of changes versus performance gain.
3 Status So Far OpenACC, minimal rewrite Updated changes to WRF (latest) bugfix release Physics Models: Thompson: 2x Morrison: 4.5x Kessler: 2x Speedups for 1 proc, 1 core versus 1 core + 1 GPU Measured in isolation of the rest of the WRF Still working on the Dynamics to get end-to-end Need scaling to more cores sharing GPU
4 WRF Parallel Performance Issues WRF has a flattish profile. By Amdahl's Law, speeding up any particular loop doesn't help much. Loops tend NOT to re-process data, so moving arrays host GPU costs more than parallel processing would save. Overall speedup requires speedup of many loops. Also requires keeping data resident on GPU for the bulk of the computation. Similar issue with HYCOM and other codes.
5 WRF Re-Coding Issues Large code, almost 1 million lines Slow compiles 30 minutes on PGI, 8 hours on Cray Fortran with modules Subroutines with thousands of lines Makes manual analysis difficult Matching data update and present directives Matching data & kernel/parallel directives with compound statements
6 WRF Profile (VampirTrace) 1 Core (PSG/IvyBridge) excl. time incl. time calls name s s zolri2_ s s psim_unstable_ s s psih_unstable_ s s 920 module_advect_em_advect_scalar_ s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s 400 module_advect_em_advect_scalar_pd_ s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s 280 module_small_step_em_advance_w_ s s module_ra_rrtm_rtrn_ s s zolri_ s s 120 module_big_step_utilities_em_ horizontal_pressure_gradient_gpu_ s s psih_stable_ s s 280 module_small_step_em_advance_uv_ s s psim_stable_ s s module_bl_ysu_ysu2d_ s s 1200 module_em_rk_update_scalar_ s s 5120 module_big_step_utilities_em_zero_tend_gpu_ s s module_sf_sfclayrev_sfclayrev1d_ s s 280 module_small_step_em_advance_mu_t_ s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ s s 40 module_pbl_driver_pbl_driver_ s s 520 module_big_step_utilities_em_horizontal_diffusion_ s 3.623s 400 module_em_rk_update_scalar_pd_ 0.278s s 40 solve_em_ ms s 1200 module_em_rk_scalar_tend_ ms s 120 module_em_rk_tendency_ 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ Total time ~3600s under VampirTrace
7 OpenACC Speedups & Slowdowns 1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time s s zolri2_ s s s s psim_unstable_ s s s s psih_unstable_ s s s s 920 module_advect_em_advect_scalar_ s s s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s s s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s s s 280 module_small_step_em_advance_w_ 7.896s 7.896s s s module_ra_rrtm_rtrn_ s s s s zolri_ s s s s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ s s psih_stable_ s s s s 280 module_small_step_em_advance_uv_ 2.003s 2.003s s s psim_stable_ s s s s module_bl_ysu_ysu2d_ s s s s 1200 module_em_rk_update_scalar_ 2.218s 2.218s s s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s s s module_sf_sfclayrev_sfclayrev1d_ s s s s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s s s 40 module_pbl_driver_pbl_driver_ s s s s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s s 3.623s 400 module_em_rk_update_scalar_pd_ s s 0.278s s 40 solve_em_ s s ms s 1200 module_em_rk_scalar_tend_ s s ms s 120 module_em_rk_tendency_ s s 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ s s Total time ~3600s under VampirTrace
8 Cumulative Speedups 1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time s s zolri2_ s s s s psim_unstable_ s s s s psih_unstable_ s s s s 920 module_advect_em_advect_scalar_ s s s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s s s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s s s 280 module_small_step_em_advance_w_ 7.896s 7.896s s s module_ra_rrtm_rtrn_ s s s s zolri_ s s s s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ s s psih_stable_ s s s s 280 module_small_step_em_advance_uv_ 2.003s 2.003s s s psim_stable_ s s s s module_bl_ysu_ysu2d_ s s s s 1200 module_em_rk_update_scalar_ 2.218s 2.218s s s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s s s module_sf_sfclayrev_sfclayrev1d_ s s s s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s s s 40 module_pbl_driver_pbl_driver_ s s s s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s s 3.623s 400 module_em_rk_update_scalar_pd_ s s 0.278s s 40 solve_em_ s s ms s 1200 module_em_rk_scalar_tend_ s s ms s 120 module_em_rk_tendency_ s s 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ s s Total time ~3600s under VampirTrace
9 WRF GPU Profile (nvprof) Time (s) Name custreamsynchronize rk_update_scalar_pd_1493_gpu morr_two_moment_micro_3412_gpu [CUDA memcpy DtoH] [CUDA memcpy HtoD] cueventsynchronize advance_w_1553_gpu morr_two_moment_micro_1365_gpu mp_morr_two_moment_799_gpu spec_bdytend_gpu_2241_gpu advance_w_1628_gpu culaunchkernel Total time ~1800 s under nvprof.
10 WRF GPU Profile (nvprof) Time (s) Name custreamsynchronize rk_update_scalar_pd_1493_gpu ---> <1 second morr_two_moment_micro_3412_gpu [CUDA memcpy DtoH] [CUDA memcpy HtoD] cueventsynchronize advance_w_1553_gpu morr_two_moment_micro_1365_gpu mp_morr_two_moment_799_gpu spec_bdytend_gpu_2241_gpu advance_w_1628_gpu culaunchkernel Total time ~1800 s under nvprof.
11 Loop Speedups So far any loop we accelerate goes to near-zero time Usually don't have to restructure the loops, just add OpenACC annotations Will revisit this once we have a broader speedup
12 GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for CPU execution: 1058 DO j = j_start, j_end 1059 DO k=kts,ktf 1060 DO i = i_start, i_end 1061 mrdx=msfux(i,j)*rdx! ADT eqn 44, 1st term on RHS 1062 tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & 1063 *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & (ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) 1065 ENDDO 1066 ENDDO 1067 ENDDO Fortran arrays are stored in column-major order Each inner iteration scans consecutive entries in the same cache line Each outer iteration processes a sequence of cache lines, with no repetition For OpenMP parallelism, you will tend to split at the outer loop and let the inner loops proceed as before.
13 GPU-coordination versus CPU-Cache Striding k-stride Adjacent threads in (i,j) warp process adjacent elements of kth column i - adjacency (i,j) threads jth Matrix Block from (i,k,j)
14 GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for GPU execution:!$acc parallel!$acc loop collapse(2) DO j = j_start, j_end DO i = i_start, i_end DO k=kts,ktf mrdx=msfux(i,j)*rdx! ADT eqn 44, 1st term on RHS tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & -(ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) ENDDO ENDDO ENDDO!$acc end parallel Iterations of outer loops are mapped to GPU threads Adjacent threads in Warp are used to process consecutive elements in same column Sequential inner loop processes consecutive columns with coordinated threads
15 Loop/Conditional Interchange 3072 j_loop_y_flux_6 : DO j = j_start, j_end IF( (j >= j_start_f ).and. (j <= j_end_f) ) THEN! use full stencil DO k=kts,ktf 3077 DO i = i_start, i_end 3078 vel = rv(i,k,j) 3079 fqy( i, k, jp1 ) = vel*flux6( & 3080 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3081 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3082 ENDDO 3083 ENDDO ELSE IF ( j == jds+1 ) THEN! 2nd order flux next to south boundary ENDDO j_loop_y_flux_ j_loop_y_flux_6 : DO k=kts,ktf 3092 DO i = i_start, i_end 3093 fqy_1 = fqy_2 = DO j = j_start, j_end IF( (j >= j_start_f ).and. (j <= j_end_f) ) THEN! use full stencil 3103 vel = rv(i,k,j) 3104 fqy_1 = vel*flux6( & 3105 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3106 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3107 ELSE IF ( j == jds+1 ) THEN! 2nd order flux next to south boundary ENDDO ENDDO 3147 ENDDO j_loop_y_flux_6
16 Horror Loop (dyn_em/module_advect_em.f) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455!DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & min(0.,fqx (i,k,j)) ) & rdy*( max(0.,fqy (i,k,j+1)) & min(0.,fqy (i,k,j )) ) ) & msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & max(0.,fqz (i,k,j)) ) ) IF( flux_out.gt. ph_low ) THEN scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j).gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i,k,j).lt. 0.) fqx(i,k,j) = scale*fqx(i,k,j) 7478 IF( fqy (i,k,j+1).gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ).lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480! note: z flux is opposite sign in mass coordinate because 7481! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j).lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k,j).gt. 0.) fqz(i,k,j) = scale*fqz(i,k,j) END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO
17 Horror Loop (dyn_em/module_advect_em.f) 7450!$acc kernels 7451!$acc loop independent collapse(3) private(ph_low,flux_out) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455!DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & min(0.,fqx (i,k,j)) ) & rdy*( max(0.,fqy (i,k,j+1)) & min(0.,fqy (i,k,j )) ) ) & msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & max(0.,fqz (i,k,j)) ) ) IF( flux_out.gt. ph_low ) THEN scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j).gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i,k,j).lt. 0.) fqx(i,k,j) = scale*fqx(i,k,j) 7478 IF( fqy (i,k,j+1).gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ).lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480! note: z flux is opposite sign in mass coordinate because 7481! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j).lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k,j).gt. 0.) fqz(i,k,j) = scale*fqz(i,k,j) END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO 7490!$acc end kernels
18 Mirroring the Record-Structure TYPE( domain ), POINTER :: grid real,dimension(grid%sm31:grid%em31,grid%sm33:grid%em33) ALLOCATE(grid%msfux(sm31:em31,sm33:em33),STAT=ierr) :: msfux real,dimension(:,:),pointer :: msfux_pt CALL alloc_gpu_extra2d(grid%msfux,msfux_pt) SUBROUTINE alloc_gpu_extra1d(grid_pt,pt) real,dimension(:),pointer :: pt real,dimension(:),pointer :: grid_pt!$acc enter data create(grid_pt) pt => grid_pt END SUBROUTINE alloc_gpu_extra1d
19 OpenACC Original Intent Well-bracketed regions Position data on GPU, process it, save result & discard the rest:!$acc data create (. ) &!$acc. call subroutine!$acc end data copy (..) Finding instead that we're swapping arrays between Host GPU memory at random points
20 Partial Port GPU Data Placement Need to reduce amount of data movement between Host GPU Requires porting more code, so it can operate on GPU-resident data Explicit swaps have to be used when moving from GPU-code to host-code.!$acc update device(...)!$acc update host(...)
21 GPU Data Placement (WRF) wrf_run -> integrate -> solve_interface -> solve_em (DATA PLACEMENT HERE) -> microphysics_driver -> mp_gt_driver -> first_rk_step_part1 -> first_rk_step_part2 -> rk_scalar_tend -> advect_scalar_pd
22 Mixing OpenACC & CPU Code In some cases it may be clearer to interleave the two in the same file: #ifdef _OPENACC! GPU-optimized loops #else! CPU-optimized loops #endif In other cases it may be clearer to duplicate the file and use the build-process to select them: module_advect_em.f module_advect_em.openacc.f You can view the changes side-by-side with sdiff.
Weather & Ocean Code Restructuring for OpenACC. Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology
Weather & Ocean Code Restructuring for OpenACC Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology Typical Application Structure Some interleaving of loops with complex sequential
More informationPorting COSMO to Hybrid Architectures
Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,
More informationAdapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs
Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, T. Dimanti, T. Schultess and the HP2C team Eidgenössisches
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationDirective-based Programming for Highly-scalable Nodes
Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism
More informationComparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015
Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming
More informationLattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)
Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,
More informationThe CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss The CLAW project CLAW provides high-level Abstractions for Weather and climate models Valentin Clément,
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationAccelerating Harmonie with GPUs (or MICs)
Accelerating Harmonie with GPUs (or MICs) (A view from the starting-point) Enda O Brien, Adam Ralph Irish Centre for High-End Computing Motivation There is constant, insatiable demand for more performance
More informationExperiences with CUDA & OpenACC from porting ACME to GPUs
Experiences with CUDA & OpenACC from porting ACME to GPUs Matthew Norman Irina Demeshko Jeffrey Larkin Aaron Vose Mark Taylor ORNL is managed by UT-Battelle for the US Department of Energy ORNL Sandia
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationUsing OpenACC with MPI Tutorial
Using OpenACC with MPI Tutorial Version 2016 PGI Compilers and Tools TABLE OF CONTENTS 5 in 5 Hours: Porting a 3D Elastic Wave Simulator to GPUs Using OpenACC...iii Chapter 1. Step 0: Evaluation...1 Chapter
More informationOpenACC and the Cray Compilation Environment James Beyer PhD
OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationOpenACC introduction (part 2)
OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence
More informationPragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationCACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL. Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI
CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU ARCHITECTURE Threads Register Files Shared Memory L1 Cache Read-Only Data Cache
More informationSENSEI / SENSEI-Lite / SENEI-LDC Updates
SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC
More informationCan Accelerators Really Accelerate Harmonie?
Can Accelerators Really Accelerate Harmonie? Enda O Brien, Adam Ralph Irish Centre for High-End Computing Motivation There is constant demand for more performance Conventional compute cores not getting
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationAccelerator programming with OpenACC
..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling
More informationAFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library
AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library Synergy@VT Collaborators: Paul Sathre, Sriram Chivukula, Kaixi Hou, Tom Scogland, Harold Trease,
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationOPENMP FOR ACCELERATORS
7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There
More informationProgramming paradigms for GPU devices
Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationJohn Levesque Nov 16, 2001
1 We see that the GPU is the best device available for us today to be able to get to the performance we want and meet our users requirements for a very high performance node with very high memory bandwidth.
More informationCLAW FORTRAN Compiler Abstractions for Weather and Climate Models
CLAW FORTRAN Compiler Abstractions for Weather and Climate Models Image: NASA PASC 17 June 27, 2017 Valentin Clement, Jon Rood, Sylvaine Ferrachat, Will Sawyer, Oliver Fuhrer, Xavier Lapillonne valentin.clement@env.ethz.ch
More informationUsing OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015
Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015 Slide 1 Background Back in 2014 : Adaptation of IFS physics cloud scheme (CLOUDSC) to new architectures
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationIntroduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA
Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationPROFILER OPENACC TUTORIAL. Version 2018
PROFILER OPENACC TUTORIAL Version 2018 TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter 1. 2. 3. 4. 5. Tutorial Setup... 1 Profiling the application... 2 Adding OpenACC directives...4 Improving
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationPGPROF OpenACC Tutorial
PGPROF OpenACC Tutorial Version 2017 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Tutorial Setup...1 Chapter 2. Profiling the application... 2 Chapter 3. Adding OpenACC directives... 4 Chapter
More informationSuccesses and Challenges Using GPUs for Weather and Climate Models
Successes and Challenges Using GPUs for Weather and Climate Models Mark Gove; Tom Henderson, Jacques Middlecoff, Jim Rosinski NOAA Earth System Research Laboratory GPU Programming Approaches Language Approach
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationIan Buck, GM GPU Computing Software
Ian Buck, GM GPU Computing Software History... GPGPU in 2004 GFLOPS recent trends multiplies per second (observed peak) NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July 01 Jan 02 July 02 Jan 03 July
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationFrom Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.
More informationIntroduction to GPU Computing. 周国峰 Wuhan University 2017/10/13
Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science
More informationINTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies
INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying
More informationPorting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)
Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5 Alistair Hart (Cray UK Ltd.) Safe Harbor Statement This presentation may contain forward-looking statements that are based on
More informationAn Introduction to OpenACC
An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationParallel Poisson Solver in Fortran
Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first
More informationDeutscher Wetterdienst
Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First
More informationIs OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels
National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationCOSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,
COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder, 8.9.2011 Supercomputing Systems AG Technopark 1 8005 Zürich 1 Fon +41 43 456 16 00 Fax +41 43 456 16 10 www.scs.ch Boulder, 8.9.2011, by
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationOpenACC Course Lecture 1: Introduction to OpenACC September 2015
OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:
More informationPhysical parametrizations and OpenACC directives in COSMO
Physical parametrizations and OpenACC directives in COSMO Xavier Lapillonne Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz Name (change on Master slide)
More informationProgress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core
Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Tom Henderson NOAA/OAR/ESRL/GSD/ACE Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff Paul Madden,
More informationThe challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.
The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy! Thomas C. Schulthess ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!1
More informationGPU Computing with OpenACC Directives
GPU Computing with OpenACC Directives Alexey Romanenko Based on Jeff Larkin s PPTs 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration
More informationAMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas
AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance
More informationTrellis: Portability Across Architectures with a High-level Framework
: Portability Across Architectures with a High-level Framework Lukasz G. Szafaryn + Todd Gamblin ++ Bronis R. de Supinski ++ Kevin Skadron + + University of Virginia {lgs9a, skadron@virginia.edu ++ Lawrence
More informationCompiling a High-level Directive-Based Programming Model for GPGPUs
Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University
More informationPortable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific
More informationHeidi Poxon Cray Inc.
Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs
More informationA Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région
More informationParallelization of the NIM Dynamical Core for GPUs
Xbox 360 CPU Parallelization of the NIM Dynamical Core for GPUs Mark Govett Jacques Middlecoff, Tom Henderson, JimRosinski, Craig Tierney Bigger Systems More oeexpensive e Facilities Bigger Power Bills
More informationLecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu
More informationPortable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences
More informationCLAW FORTRAN Compiler source-to-source translation for performance portability
CLAW FORTRAN Compiler source-to-source translation for performance portability XcalableMP Workshop, Akihabara, Tokyo, Japan October 31, 2017 Valentin Clement valentin.clement@env.ethz.ch Image: NASA Summary
More informationCONTINUED EFFORTS IN ADAPTING THE GEOS- 5 AGCM TO ACCELERATORS: SUCCESSES AND CHALLENGES
CONTINUED EFFORTS IN ADAPTING THE GEOS- 5 AGCM TO ACCELERATORS: SUCCESSES AND CHALLENGES 9/20/2013 Matt Thompson matthew.thompson@nasa.gov Accelerator Conversion Aims Code rewrites will probably be necessary,
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationOPENACC: ACCELERATE KIRCHHOFF 2D MIGRATION
April 4-7, 2016 Silicon Valley OPENACC: ACCELERATE KIRCHHOFF 2D MIGRATION Ken Hester: NVIDIA Solution Architect Oil &Gas EXPLORATION & PRODUCTION WORKFLOW Acquire Seismic Data Process Seismic Data Interpret
More informationParallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian
Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationObjective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC
GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application
More informationPerformance Portability and OpenACC
Performance Portability and OpenACC Douglas Miles, David Norton and Michael Wolfe PGI / NVIDIA ABSTRACT: Performance portability means a single program gives good performance across a variety of systems,
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationProgramming NVIDIA GPUs with OpenACC Directives
Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe mwolfe@nvidia.com http://www.pgroup.com/accelerate
More informationKernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets
P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents
More informationCPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability
Xbox 360 Successes and Challenges using GPUs for Weather and Climate Models DOE Jaguar Mark GoveM Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney CPU Bigger Systems More Expensive Facili:es
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More information