Progress Porting WRF to GPU using OpenACC

Size: px
Start display at page:

Download "Progress Porting WRF to GPU using OpenACC"

Transcription

1 Progress Porting WRF to GPU using OpenACC Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology Alexey Romanenko Alexey Snytnikov NVIDIA OpenSource Applications

2 WRF Porting Plan WRF developers interested in GPU port if performance is suitable. OpenACC extensions are acceptable because: This is an open standard, unlike CUDA Fortran Minimal changes, like OpenMP, unlike OpenCL Changes to WRF modules will need to be negotiated with developers in order to provide support. Tradeoff is extent of changes versus performance gain.

3 Status So Far OpenACC, minimal rewrite Updated changes to WRF (latest) bugfix release Physics Models: Thompson: 2x Morrison: 4.5x Kessler: 2x Speedups for 1 proc, 1 core versus 1 core + 1 GPU Measured in isolation of the rest of the WRF Still working on the Dynamics to get end-to-end Need scaling to more cores sharing GPU

4 WRF Parallel Performance Issues WRF has a flattish profile. By Amdahl's Law, speeding up any particular loop doesn't help much. Loops tend NOT to re-process data, so moving arrays host GPU costs more than parallel processing would save. Overall speedup requires speedup of many loops. Also requires keeping data resident on GPU for the bulk of the computation. Similar issue with HYCOM and other codes.

5 WRF Re-Coding Issues Large code, almost 1 million lines Slow compiles 30 minutes on PGI, 8 hours on Cray Fortran with modules Subroutines with thousands of lines Makes manual analysis difficult Matching data update and present directives Matching data & kernel/parallel directives with compound statements

6 WRF Profile (VampirTrace) 1 Core (PSG/IvyBridge) excl. time incl. time calls name s s zolri2_ s s psim_unstable_ s s psih_unstable_ s s 920 module_advect_em_advect_scalar_ s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s 400 module_advect_em_advect_scalar_pd_ s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s 280 module_small_step_em_advance_w_ s s module_ra_rrtm_rtrn_ s s zolri_ s s 120 module_big_step_utilities_em_ horizontal_pressure_gradient_gpu_ s s psih_stable_ s s 280 module_small_step_em_advance_uv_ s s psim_stable_ s s module_bl_ysu_ysu2d_ s s 1200 module_em_rk_update_scalar_ s s 5120 module_big_step_utilities_em_zero_tend_gpu_ s s module_sf_sfclayrev_sfclayrev1d_ s s 280 module_small_step_em_advance_mu_t_ s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ s s 40 module_pbl_driver_pbl_driver_ s s 520 module_big_step_utilities_em_horizontal_diffusion_ s 3.623s 400 module_em_rk_update_scalar_pd_ 0.278s s 40 solve_em_ ms s 1200 module_em_rk_scalar_tend_ ms s 120 module_em_rk_tendency_ 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ Total time ~3600s under VampirTrace

7 OpenACC Speedups & Slowdowns 1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time s s zolri2_ s s s s psim_unstable_ s s s s psih_unstable_ s s s s 920 module_advect_em_advect_scalar_ s s s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s s s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s s s 280 module_small_step_em_advance_w_ 7.896s 7.896s s s module_ra_rrtm_rtrn_ s s s s zolri_ s s s s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ s s psih_stable_ s s s s 280 module_small_step_em_advance_uv_ 2.003s 2.003s s s psim_stable_ s s s s module_bl_ysu_ysu2d_ s s s s 1200 module_em_rk_update_scalar_ 2.218s 2.218s s s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s s s module_sf_sfclayrev_sfclayrev1d_ s s s s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s s s 40 module_pbl_driver_pbl_driver_ s s s s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s s 3.623s 400 module_em_rk_update_scalar_pd_ s s 0.278s s 40 solve_em_ s s ms s 1200 module_em_rk_scalar_tend_ s s ms s 120 module_em_rk_tendency_ s s 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ s s Total time ~3600s under VampirTrace

8 Cumulative Speedups 1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time s s zolri2_ s s s s psim_unstable_ s s s s psih_unstable_ s s s s 920 module_advect_em_advect_scalar_ s s s s 40 module_mp_morr_two_moment_morr_two_moment_micro_ s s s s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s s s 40 module_mp_morr_two_moment_mp_morr_two_moment_ s s s s 280 module_small_step_em_advance_w_ 7.896s 7.896s s s module_ra_rrtm_rtrn_ s s s s zolri_ s s s s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ s s psih_stable_ s s s s 280 module_small_step_em_advance_uv_ 2.003s 2.003s s s psim_stable_ s s s s module_bl_ysu_ysu2d_ s s s s 1200 module_em_rk_update_scalar_ 2.218s 2.218s s s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s s s module_sf_sfclayrev_sfclayrev1d_ s s s s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s s s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s s s 40 module_pbl_driver_pbl_driver_ s s s s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s s 3.623s 400 module_em_rk_update_scalar_pd_ s s 0.278s s 40 solve_em_ s s ms s 1200 module_em_rk_scalar_tend_ s s ms s 120 module_em_rk_tendency_ s s 8.652ms s 40 module_first_rk_step_part2_first_rk_step_part2_ s s Total time ~3600s under VampirTrace

9 WRF GPU Profile (nvprof) Time (s) Name custreamsynchronize rk_update_scalar_pd_1493_gpu morr_two_moment_micro_3412_gpu [CUDA memcpy DtoH] [CUDA memcpy HtoD] cueventsynchronize advance_w_1553_gpu morr_two_moment_micro_1365_gpu mp_morr_two_moment_799_gpu spec_bdytend_gpu_2241_gpu advance_w_1628_gpu culaunchkernel Total time ~1800 s under nvprof.

10 WRF GPU Profile (nvprof) Time (s) Name custreamsynchronize rk_update_scalar_pd_1493_gpu ---> <1 second morr_two_moment_micro_3412_gpu [CUDA memcpy DtoH] [CUDA memcpy HtoD] cueventsynchronize advance_w_1553_gpu morr_two_moment_micro_1365_gpu mp_morr_two_moment_799_gpu spec_bdytend_gpu_2241_gpu advance_w_1628_gpu culaunchkernel Total time ~1800 s under nvprof.

11 Loop Speedups So far any loop we accelerate goes to near-zero time Usually don't have to restructure the loops, just add OpenACC annotations Will revisit this once we have a broader speedup

12 GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for CPU execution: 1058 DO j = j_start, j_end 1059 DO k=kts,ktf 1060 DO i = i_start, i_end 1061 mrdx=msfux(i,j)*rdx! ADT eqn 44, 1st term on RHS 1062 tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & 1063 *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & (ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) 1065 ENDDO 1066 ENDDO 1067 ENDDO Fortran arrays are stored in column-major order Each inner iteration scans consecutive entries in the same cache line Each outer iteration processes a sequence of cache lines, with no repetition For OpenMP parallelism, you will tend to split at the outer loop and let the inner loops proceed as before.

13 GPU-coordination versus CPU-Cache Striding k-stride Adjacent threads in (i,j) warp process adjacent elements of kth column i - adjacency (i,j) threads jth Matrix Block from (i,k,j)

14 GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for GPU execution:!$acc parallel!$acc loop collapse(2) DO j = j_start, j_end DO i = i_start, i_end DO k=kts,ktf mrdx=msfux(i,j)*rdx! ADT eqn 44, 1st term on RHS tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & -(ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) ENDDO ENDDO ENDDO!$acc end parallel Iterations of outer loops are mapped to GPU threads Adjacent threads in Warp are used to process consecutive elements in same column Sequential inner loop processes consecutive columns with coordinated threads

15 Loop/Conditional Interchange 3072 j_loop_y_flux_6 : DO j = j_start, j_end IF( (j >= j_start_f ).and. (j <= j_end_f) ) THEN! use full stencil DO k=kts,ktf 3077 DO i = i_start, i_end 3078 vel = rv(i,k,j) 3079 fqy( i, k, jp1 ) = vel*flux6( & 3080 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3081 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3082 ENDDO 3083 ENDDO ELSE IF ( j == jds+1 ) THEN! 2nd order flux next to south boundary ENDDO j_loop_y_flux_ j_loop_y_flux_6 : DO k=kts,ktf 3092 DO i = i_start, i_end 3093 fqy_1 = fqy_2 = DO j = j_start, j_end IF( (j >= j_start_f ).and. (j <= j_end_f) ) THEN! use full stencil 3103 vel = rv(i,k,j) 3104 fqy_1 = vel*flux6( & 3105 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3106 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3107 ELSE IF ( j == jds+1 ) THEN! 2nd order flux next to south boundary ENDDO ENDDO 3147 ENDDO j_loop_y_flux_6

16 Horror Loop (dyn_em/module_advect_em.f) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455!DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & min(0.,fqx (i,k,j)) ) & rdy*( max(0.,fqy (i,k,j+1)) & min(0.,fqy (i,k,j )) ) ) & msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & max(0.,fqz (i,k,j)) ) ) IF( flux_out.gt. ph_low ) THEN scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j).gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i,k,j).lt. 0.) fqx(i,k,j) = scale*fqx(i,k,j) 7478 IF( fqy (i,k,j+1).gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ).lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480! note: z flux is opposite sign in mass coordinate because 7481! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j).lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k,j).gt. 0.) fqz(i,k,j) = scale*fqz(i,k,j) END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO

17 Horror Loop (dyn_em/module_advect_em.f) 7450!$acc kernels 7451!$acc loop independent collapse(3) private(ph_low,flux_out) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455!DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & min(0.,fqx (i,k,j)) ) & rdy*( max(0.,fqy (i,k,j+1)) & min(0.,fqy (i,k,j )) ) ) & msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & max(0.,fqz (i,k,j)) ) ) IF( flux_out.gt. ph_low ) THEN scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j).gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i,k,j).lt. 0.) fqx(i,k,j) = scale*fqx(i,k,j) 7478 IF( fqy (i,k,j+1).gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ).lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480! note: z flux is opposite sign in mass coordinate because 7481! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j).lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k,j).gt. 0.) fqz(i,k,j) = scale*fqz(i,k,j) END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO 7490!$acc end kernels

18 Mirroring the Record-Structure TYPE( domain ), POINTER :: grid real,dimension(grid%sm31:grid%em31,grid%sm33:grid%em33) ALLOCATE(grid%msfux(sm31:em31,sm33:em33),STAT=ierr) :: msfux real,dimension(:,:),pointer :: msfux_pt CALL alloc_gpu_extra2d(grid%msfux,msfux_pt) SUBROUTINE alloc_gpu_extra1d(grid_pt,pt) real,dimension(:),pointer :: pt real,dimension(:),pointer :: grid_pt!$acc enter data create(grid_pt) pt => grid_pt END SUBROUTINE alloc_gpu_extra1d

19 OpenACC Original Intent Well-bracketed regions Position data on GPU, process it, save result & discard the rest:!$acc data create (. ) &!$acc. call subroutine!$acc end data copy (..) Finding instead that we're swapping arrays between Host GPU memory at random points

20 Partial Port GPU Data Placement Need to reduce amount of data movement between Host GPU Requires porting more code, so it can operate on GPU-resident data Explicit swaps have to be used when moving from GPU-code to host-code.!$acc update device(...)!$acc update host(...)

21 GPU Data Placement (WRF) wrf_run -> integrate -> solve_interface -> solve_em (DATA PLACEMENT HERE) -> microphysics_driver -> mp_gt_driver -> first_rk_step_part1 -> first_rk_step_part2 -> rk_scalar_tend -> advect_scalar_pd

22 Mixing OpenACC & CPU Code In some cases it may be clearer to interleave the two in the same file: #ifdef _OPENACC! GPU-optimized loops #else! CPU-optimized loops #endif In other cases it may be clearer to duplicate the file and use the build-process to select them: module_advect_em.f module_advect_em.openacc.f You can view the changes side-by-side with sdiff.

Weather & Ocean Code Restructuring for OpenACC. Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology

Weather & Ocean Code Restructuring for OpenACC. Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology Weather & Ocean Code Restructuring for OpenACC Carl Ponder, Ph.D. HPC Applications Performance NVIDIA Developer Technology Typical Application Structure Some interleaving of loops with complex sequential

More information

Porting COSMO to Hybrid Architectures

Porting COSMO to Hybrid Architectures Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,

More information

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, T. Dimanti, T. Schultess and the HP2C team Eidgenössisches

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Directive-based Programming for Highly-scalable Nodes

Directive-based Programming for Highly-scalable Nodes Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,

More information

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss The CLAW project CLAW provides high-level Abstractions for Weather and climate models Valentin Clément,

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Accelerating Harmonie with GPUs (or MICs)

Accelerating Harmonie with GPUs (or MICs) Accelerating Harmonie with GPUs (or MICs) (A view from the starting-point) Enda O Brien, Adam Ralph Irish Centre for High-End Computing Motivation There is constant, insatiable demand for more performance

More information

Experiences with CUDA & OpenACC from porting ACME to GPUs

Experiences with CUDA & OpenACC from porting ACME to GPUs Experiences with CUDA & OpenACC from porting ACME to GPUs Matthew Norman Irina Demeshko Jeffrey Larkin Aaron Vose Mark Taylor ORNL is managed by UT-Battelle for the US Department of Energy ORNL Sandia

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Using OpenACC with MPI Tutorial

Using OpenACC with MPI Tutorial Using OpenACC with MPI Tutorial Version 2016 PGI Compilers and Tools TABLE OF CONTENTS 5 in 5 Hours: Porting a 3D Elastic Wave Simulator to GPUs Using OpenACC...iii Chapter 1. Step 0: Evaluation...1 Chapter

More information

OpenACC and the Cray Compilation Environment James Beyer PhD

OpenACC and the Cray Compilation Environment James Beyer PhD OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

OpenACC introduction (part 2)

OpenACC introduction (part 2) OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence

More information

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL. Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL. Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI GPU ARCHITECTURE Threads Register Files Shared Memory L1 Cache Read-Only Data Cache

More information

SENSEI / SENSEI-Lite / SENEI-LDC Updates

SENSEI / SENSEI-Lite / SENEI-LDC Updates SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC

More information

Can Accelerators Really Accelerate Harmonie?

Can Accelerators Really Accelerate Harmonie? Can Accelerators Really Accelerate Harmonie? Enda O Brien, Adam Ralph Irish Centre for High-End Computing Motivation There is constant demand for more performance Conventional compute cores not getting

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library Synergy@VT Collaborators: Paul Sathre, Sriram Chivukula, Kaixi Hou, Tom Scogland, Harold Trease,

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

OPENMP FOR ACCELERATORS

OPENMP FOR ACCELERATORS 7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

John Levesque Nov 16, 2001

John Levesque Nov 16, 2001 1 We see that the GPU is the best device available for us today to be able to get to the performance we want and meet our users requirements for a very high performance node with very high memory bandwidth.

More information

CLAW FORTRAN Compiler Abstractions for Weather and Climate Models

CLAW FORTRAN Compiler Abstractions for Weather and Climate Models CLAW FORTRAN Compiler Abstractions for Weather and Climate Models Image: NASA PASC 17 June 27, 2017 Valentin Clement, Jon Rood, Sylvaine Ferrachat, Will Sawyer, Oliver Fuhrer, Xavier Lapillonne valentin.clement@env.ethz.ch

More information

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015 Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015 Slide 1 Background Back in 2014 : Adaptation of IFS physics cloud scheme (CLOUDSC) to new architectures

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

PROFILER OPENACC TUTORIAL. Version 2018

PROFILER OPENACC TUTORIAL. Version 2018 PROFILER OPENACC TUTORIAL Version 2018 TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter 1. 2. 3. 4. 5. Tutorial Setup... 1 Profiling the application... 2 Adding OpenACC directives...4 Improving

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

PGPROF OpenACC Tutorial

PGPROF OpenACC Tutorial PGPROF OpenACC Tutorial Version 2017 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Tutorial Setup...1 Chapter 2. Profiling the application... 2 Chapter 3. Adding OpenACC directives... 4 Chapter

More information

Successes and Challenges Using GPUs for Weather and Climate Models

Successes and Challenges Using GPUs for Weather and Climate Models Successes and Challenges Using GPUs for Weather and Climate Models Mark Gove; Tom Henderson, Jacques Middlecoff, Jim Rosinski NOAA Earth System Research Laboratory GPU Programming Approaches Language Approach

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

Ian Buck, GM GPU Computing Software

Ian Buck, GM GPU Computing Software Ian Buck, GM GPU Computing Software History... GPGPU in 2004 GFLOPS recent trends multiplies per second (observed peak) NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July 01 Jan 02 July 02 Jan 03 July

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.

More information

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13 Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.) Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5 Alistair Hart (Cray UK Ltd.) Safe Harbor Statement This presentation may contain forward-looking statements that are based on

More information

An Introduction to OpenACC

An Introduction to OpenACC An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15

More information

COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

Deutscher Wetterdienst

Deutscher Wetterdienst Accelerating Work at DWD Ulrich Schättler Deutscher Wetterdienst Roadmap Porting operational models: revisited Preparations for enabling practical work at DWD My first steps with the COSMO on a GPU First

More information

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder,

COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder, COSMO Dynamical Core Redesign Tobias Gysi David Müller Boulder, 8.9.2011 Supercomputing Systems AG Technopark 1 8005 Zürich 1 Fon +41 43 456 16 00 Fax +41 43 456 16 10 www.scs.ch Boulder, 8.9.2011, by

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

Physical parametrizations and OpenACC directives in COSMO

Physical parametrizations and OpenACC directives in COSMO Physical parametrizations and OpenACC directives in COSMO Xavier Lapillonne Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz Name (change on Master slide)

More information

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core Tom Henderson NOAA/OAR/ESRL/GSD/ACE Thomas.B.Henderson@noaa.gov Mark Govett, Jacques Middlecoff Paul Madden,

More information

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C. The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy! Thomas C. Schulthess ENES HPC Workshop, Hamburg, March 17, 2014 T. Schulthess!1

More information

GPU Computing with OpenACC Directives

GPU Computing with OpenACC Directives GPU Computing with OpenACC Directives Alexey Romanenko Based on Jeff Larkin s PPTs 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration

More information

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance

More information

Trellis: Portability Across Architectures with a High-level Framework

Trellis: Portability Across Architectures with a High-level Framework : Portability Across Architectures with a High-level Framework Lukasz G. Szafaryn + Todd Gamblin ++ Bronis R. de Supinski ++ Kevin Skadron + + University of Virginia {lgs9a, skadron@virginia.edu ++ Lawrence

More information

Compiling a High-level Directive-Based Programming Model for GPGPUs

Compiling a High-level Directive-Based Programming Model for GPGPUs Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University

More information

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific

More information

Heidi Poxon Cray Inc.

Heidi Poxon Cray Inc. Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs

More information

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région

More information

Parallelization of the NIM Dynamical Core for GPUs

Parallelization of the NIM Dynamical Core for GPUs Xbox 360 CPU Parallelization of the NIM Dynamical Core for GPUs Mark Govett Jacques Middlecoff, Tom Henderson, JimRosinski, Craig Tierney Bigger Systems More oeexpensive e Facilities Bigger Power Bills

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

CLAW FORTRAN Compiler source-to-source translation for performance portability

CLAW FORTRAN Compiler source-to-source translation for performance portability CLAW FORTRAN Compiler source-to-source translation for performance portability XcalableMP Workshop, Akihabara, Tokyo, Japan October 31, 2017 Valentin Clement valentin.clement@env.ethz.ch Image: NASA Summary

More information

CONTINUED EFFORTS IN ADAPTING THE GEOS- 5 AGCM TO ACCELERATORS: SUCCESSES AND CHALLENGES

CONTINUED EFFORTS IN ADAPTING THE GEOS- 5 AGCM TO ACCELERATORS: SUCCESSES AND CHALLENGES CONTINUED EFFORTS IN ADAPTING THE GEOS- 5 AGCM TO ACCELERATORS: SUCCESSES AND CHALLENGES 9/20/2013 Matt Thompson matthew.thompson@nasa.gov Accelerator Conversion Aims Code rewrites will probably be necessary,

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

OPENACC: ACCELERATE KIRCHHOFF 2D MIGRATION

OPENACC: ACCELERATE KIRCHHOFF 2D MIGRATION April 4-7, 2016 Silicon Valley OPENACC: ACCELERATE KIRCHHOFF 2D MIGRATION Ken Hester: NVIDIA Solution Architect Oil &Gas EXPLORATION & PRODUCTION WORKFLOW Acquire Seismic Data Process Seismic Data Interpret

More information

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application

More information

Performance Portability and OpenACC

Performance Portability and OpenACC Performance Portability and OpenACC Douglas Miles, David Norton and Michael Wolfe PGI / NVIDIA ABSTRACT: Performance portability means a single program gives good performance across a variety of systems,

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

Programming NVIDIA GPUs with OpenACC Directives

Programming NVIDIA GPUs with OpenACC Directives Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate Programming NVIDIA GPUs with OpenACC Directives Michael Wolfe mwolfe@nvidia.com http://www.pgroup.com/accelerate

More information

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents

More information

CPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability

CPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability Xbox 360 Successes and Challenges using GPUs for Weather and Climate Models DOE Jaguar Mark GoveM Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney CPU Bigger Systems More Expensive Facili:es

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information