Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage

Size: px
Start display at page:

Download "Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage"

Transcription

1 Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage INFN - Università di Parma November 6, 2012

2 Table of contents 1 Introduction 2 3 4

3 Let me introduce myself I am a 2 nd year PhD Student in Physics at Parma University Master Thesis in Florence with L. Del Zanna: oscillations of rotating Neutron Stars (ECHO code) PhD in Parma with R. De Pietri: dynamical instabilities of rotating NS with Magnetic Fields in GR (Cactus/Whisky) PRACE Winter School 2012 here in CINECA I would like to have a fast working code to perform 2D simulations of NS on a cluster with GPUs The starting point is the 2- and 3-D code ECHO The ECHO code would need to be optimized, fully parallelized and ported on GPUs

4 General Relativistic Magnetohydrodynamics (GRMHD) GRMHD is the study of magnetized fluid flows in general relativistic spacetimes It is required for the study of extremely compact astrophysical objects, like neutron stars and black holes The spacetime is a solution of Einsteins field equations of general relativity, and can be either solved for dynamically or assumed given by the initial data (Cowling approximation) Solution techniques are similar to traditional Computational Fluid Dynamics, with some important differences

5 : Overview Eulerian Conservative High Order code: the aim is to combine shock-capturing properties and accuracy for small scale wave propagation and turbulence, in a 3+1 approach (L. Del Zanna, O. Zanotti, N. Bucciantini, P. Londrillo, 2007, A&A 473, 11) GR upgrade of: Londrillo & Del Zanna 2000; Del Zanna et al. 2002, 2003 Modular structure, F90 language, MPI parallelization Any metric allowed (1-, 2- or 3-D), even time-dependent Finite-difference scheme, Runge-Kutta time-stepping UCT strategy for the magnetic field (staggered grid) Central-type Riemann solvers (LLF, HLL, HLLC) Upgrades: resistivity, radiation HD, evolving spacetimes

6 : numerical scheme The two sets of conservation laws are discretized in space according the Upwind Constrained Transport strategies (UCT: Londrillo & Del Zanna ApJ 530, 508, 2000; JCP 195, 17, 2004) Staggered grid for magnetic and electric field components Finite differences: point values at cell centers (u), at cell faces (b and f), at edges (e).

7 : Upwind Constrained Transport The divergence-free constraint is satisfied algebraically at 2 nd spatial order Single upwind state for B i along direction i Four-state numerical fluxes for the magnetic field at edges

8 : the evolution scheme (1/3) The ECHO evolution scheme needs to: Compute the primitive fluid variables from the conservative ones (see next) and interpolate the B components at cell center 8 primitive variables at cell center: P = [ ρ, v, p, B] T For every direction, fill the boundary ghost zones with the values of the primitives and reconstruct left (L) and right /R) upwind states at interfaces (B i along i is unchanged) P L,R i+1/2, j = RL,R ({P ij }) Compute metric properties at interfaces interpolating the metrics defined on the grid (only at t = 0 in Cowling approx.)

9 : the evolution scheme (2/3) Compute the upwind fluxes for the fluid part from the primitive variables using an approximate Riemann solver (save the fast-magnetosonic and transverse transport speeds) For every direction, compute the spatial derivatives of the fluxes at cell center and maximum wave speed Compute geometrical source terms (right-hand side of the conservation laws) Reconstruct B e v from cell faces to cell edge Compute numerical fluxes for the magnetic field (upwind electric fields) and their derivatives

10 : the evolution scheme (3/3) Compute the maximum time step obeying the CFL (Courant-Friedrichs-Lewy) condition 0 < c < 1: t = c max i (a i M /h i) (am i are the maximum speeds over the whole domain, for each direction i) Update all the variables using fluxes and electric fields derivatives with a 2 nd order Runga-Kutta scheme 1 u (1) = u n + t R(u n ) 2 u n+1 = 1 2 un [ ] u (1) + t R(u (1) )

11 ECHO: The time evolution routine

12 Optimizations The main optimizations on the original code are the following: implicit none was added to all subroutines so that all the variables are now explicitly declared all the variables read and written by a subroutine are now passed as arguments and declared with their own intent many do cycles were conveniently merged or re-arranged some routines have been splitted in different routines for different cases many useless temporary array variables were eliminated all the physicical parameters, the evolution parameters and the grid and boundaries settings are now contained in one file instead of being scattered among many source files

13 CPUs vs GPUs CPUs design multi-core sophisticated control logic unit large cache memories to reduce access latencies GPUs design many-cores (several hundreds) minimized control logic in order to manage leightweight threads and maximize execution throughput large number of threads to overcome long-latency memory accesses

14 OpenACC: What is it? is a programming API standard to program accelerators is portable across operating systems and various types of host CPUs and accelerators. allows programmers to provide directives to the compiler identifying which areas of code to accelerate, without requiring programmers to modify or adapt the code itself is aimed at incremental development of accelerator code

15 OpenACC: About the standard announced at the SC11 conference (Seattle, November 2011) offers portability between compilers drawn up by: NVIDIA, Cray, PGI, CAPS works for Fortran, C, C++ standard available at current version: 1.0a (April 2012) work is now targeting additional features for v1.1 compiler support: all complete in 2012 (PGI full support from version 12.6, late August)

16 OpenACC: Execution model Host-directed execution with attached GPU accelerator Main program executes on host (i.e. CPU) Compute-intensive regions offloaded to the accelerator device under control of the host. device (i.e. GPU) executes parallel regions typically contain kernels (i.e. work-sharing loops), or kernels regions, containing loops executed as kernels. Host must orchestrate the execution by: allocating memory on the accelerator device, initiating data transfer, sending the code to the accelerator, passing arguments to the parallel region, queuing the device code, waiting for completion, transferring results back to the host, and deallocating memory.

17 OpenACC: Memory model Memory spaces on the host and device distinct Different locations, different address space Data movement performed by host using runtime library calls that explicitly move data GPUs have a weak memory model No synchronisation between different execution units (SMs) (unless explicit memory barrier) Can write OpenACC kernels with race conditions Giving inconsistent execution results Compiler will catch most errors, but not all OpenACC data movement between the memories implicit managed by the compiler, based on directives from the programmer. Device memory caches are managed by the compiler with hints from the programmer in the form of directives.

18 OpenACC: Benefits Modify original source code with directives Non-executable statements (comments, pragmas) Can be ignored by non-accelerating compiler Don t need separate source-base, much more portable Can preserve subprogram structure Familiar programming model if used traditional OpenMP A small performance gap acceptable (target is 10-15%, currently seeing better than that for many cases) An open standard is the most attractive for developers Portability, multiple compilers for debugging

19 OpenACC: Levels of parallelism The model target architecture is a collection of processing elements or PEs, where each PE is multithreaded, and each thread on the PE can execute vector instructions. The OpenACC execution model has three levels of parallelism: the gang dimension would map across the PEs (CUDA blocks) the worker dimension across the multithreading dimension within a PE (warps) the vector dimension to the vector instructions (threads within a warp) There is no support for any synchronization between gangs, since current accelerators typically do not support synchronization across PEs.

20 OpenACC: Categories of OpenACC APIs Accelerator Parallel Region / Kernels Directives Loop Directives Data Declaration Directives Data Regions Directives Wait / update directives Runtime Library Routines Environment variables

21 OpenACC: Accelerator directives Fortran: sentinel:!$acc * paired with!$acc end * extra lines: & +!$acc& C/C++: sentinel: #pragma acc * structured block {... } avoids need for end directives extra lines: \ (at line end)!$acc directive-name [clause [,clause]]

22 OpenACC: Parallel Directive 1/2 Starts parallel execution on accelerator Specified by:!$acc parallel [clause [,clause]] When encountered: Gangs of workers threads are created to execute on accelerator One worker in each gang begins executing the code following the structured block Number of gangs/workers remains constant in parallel region

23 OpenACC: Parallel Directive 2/2 The clauses for the!$acc parallel directive are: if(condition) async [(scalar-integer-expression)] num gangs, num workers, vector length (scalar-integer-expr.) reduction (operator:list) copy, copyin, copyout (list) create (list) private (list) present, present or... (list) If async is not present, there is an implicit barrier at the end of accelerator parallel region. present or copy default for aggregate types (arrays) private or copy default for scalar variables

24 OpenACC: Kernels Directive Defines a region of a program that is to be compiled into a sequence of kernels for execution on the accelerator Each loop nest will be a different kernel Kernels launched in order in device Specified by:!$acc kernels [clause [,clause]] Kernels directive may not contain nested parallel or kernel directive Configuration of gangs and worker thread may be different for each kernel If async is present, kernels or parallel region will execute asynchronous on accelerator present or copy default for aggregate types (arrays) private or copy default for scalar variables

25 OpenACC: Loop Directive Used to describe what type of parallelism to use to execute the loop in the accelerator Can be used to declare loop-private variables, arrays and reduction operations Specified by:!$acc loop [clause [,clause]] + a do loop The clauses for the!$acc loop directive are: collapse (n) gang, worker, vector [( scalar-integer-expression )] seq independent private (list) reduction ( operator : list) Combined directives are specified by:!$acc parallel loop,!$acc kernels loop

26 OpenACC: Data Directive The data construct defines scalars, arrays and subarrays to be allocated in the accelerator memory for the duration of the region Can be used to control if data should be copied-in or out from the host Specified by:!$acc data [clause [,clause]] The clauses for the!$acc data directive are: if( condition) copy, copyin, copyout (list) create (list) present, present or... (list) deviceptr (list)

27 OpenACC: Declare Directive Used in the variable declaration section of program to specify that a variable should be allocated, copy-in/out in an implicit data region of a function, subroutine or program If specified within a Fortran Module, the implicit data region is valid for the whole program Specified by:!$acc declare [clause [,clause]] Not fully implemented in PGI compiler release 12.8 yet? see later...

28 OpenACC: Update and Wait Directives OpenACC Update Directive:!$acc update [clause [,clause]] Used within a data region to update / synchronize the values of the arrays on both the host or accelerator The clauses for the!$acc update directive are: host,device (list) if (condition) async [( scalar-integer-expression)] OpenACC Wait Directive:!$acc wait [(scalar-integer-expression)] It causes the program to wait for completion of an asynchronous activity such as an accelerator parallel, kernel region or update directive It will test and evaluate the integer expression for completion If no argument is specified, the host process will wait until all asynchronous activities have completed

29 OpenACC: Directive Status The PGI compiler has been providing full OpenACC support since release 12.6 (August 2012) Some of the features supposed to be implemented in 12.6 are:!$acc parallel reduction and!$acc loop reduction!$acc parallel private() and!$acc loop private()!$acc parallel present() and!$acc parallel present or...()!$acc parallel deviceptr()!$acc declare device resident() and!$acc declare deviceptr() Actually, from the pgroup user forum: We re still missing a few OpenACC features and device resident is one of them. We expect it to be in by the release

30 OpenACC: Runtime Routines acc get num devices() acc set device type() acc get device type() acc set device num() acc get device num() acc async wait() acc async wait all() acc init() acc shutdown() acc on device() acc malloc() acc free()

31 OpenACC: Environment Variables ACC DEVICE TYPE and ACC DEVICE NUM ACC NOTIFY shows the list of launched kernels with detailed information about the number of the device who executes the kernel the name of the function the kernels is launched from, the file that contains it and the line in the file the grid and block dimensions Example: export ACC NOTIFY=1 file=.../testgpu.f90 function=testgpu line=18 device=0 grid=1x200 block=128 queue=0 PGI ACC TIME is equivalent to the flag -ta=nvidia,time Example: export PGI ACC TIME=1 18: region entered 1000 times time(us): total=536,077 init=177 region=535,900 kernels=11,627 data=286,750 w/o init: total=535,900 max=716 min=519 avg=535

32 OpenACC: Mandatory requirements Privatize arrays (scalars are private by default) Error: Parallelization would require privatization of array a(:) All loops must be rectangular Restructure linearized arrays with computed indices Error: Non-stride-1 accesses for array b Privatize live-out scalars Accelerator restriction: induction variable live-out from loop No function calls in directives regions manually or automatically inline subroutines) Error: Accelerator region ignored Accelerator restriction: function/procedure calls are not supported Avoid print or write operations

33 OpenACC: Tips All parallel regions should contain a loop directive Always start with parallel loop (not just parallel) Always use : when shaping with an entire dimension (i.e. A(:,1:2) ) Watch for runtime device errors, for example: Call to cumemcpydtoh returned error: Launch failed Call to cumemcpy2d returned error: Invalid value First get your code working without data regions, then add data regions: be aware of data movement leave data on GPU across procedure boundaries First get your code working without async, than add async Use directive clauses to optimize performance

34 OpenACC: Compilation flags The minimum set of flags needed to use OpenACC directives is: -acc (enables OpenACC) -ta=nvidia (target) -ta=nvidia:cc20 (GPU cabability, cc20 stands for 2.0) -ta=nvidia,time (enables Accelerator Kernel Timing data) -ta=nvidia,host (generates two versions of routines, one that runs on the host and one on the GPU) -Minline -Mipa=fast,inline,reshape (enables IPA, automatic inlining and array reshaping) -O2 at least (if less, -Mipa forces -O2 anyway) -Minfo=inline,accel (enables compiler feedback) Example: pgfortran -fast -O3 -acc -ta=host,nvidia:cc20,time -Minline -Mipa=fast,inline,reshape -Minfo=accel

35 OpenACC: Simple examples See live examples on PLX...

36 OpenACC: Performances Example: Jacobi relaxation Calculation: 4096 x 4096 mesh directory with source code on PLX: /plx/userexternal/lfranci0/stage/openacc/ wiki page with all details and results

37 ECHO-GPU: Necessary rearrangements Many changes and rearrangements were necessary to allow all subroutines in the evolution routine to be run as GPU kernels: it was mandatory to use at least -O2 optimization flag, interprocedural analysis and automatic inlining and reshaping some small subroutines, or subroutines used only once, were manually inlined to better manage the variables many do cycles were re-arranged for synchronization purpose all the print statements inside parallel regions were removed a common statement was removed many exit statements were removed temporary arrays were substituted with scalars where possible many little do cycles were manually unrolled many temporary arrays were privatized

38 ECHO-GPU: partial OpenACC implementation

39 ECHO-GPU: full OpenACC implementation only the primitive and conservative variables (together with the metric terms and the grid variables) are copied into the device at the beginning of the evolution and copied back to the host at the end of the simulation all the other variables are created directly on the GPU with!$acc create or!$acc device resident, and then declared present in all the subroutines called inside evolve

40 ECHO-GPU: Performances Test run: 2D mesh, 120x50 points, tmax=0.005 ms Execution time CPU-version: sec GPU-version: sec (already achieved) GPU-version: 40 sec (theoretical but plausible value with a small further effort) Speedup achieved: 1.5x theoretical: 2.3x Real runs have much longer evolution times and finer grids and in this case the performances are supposed to be better

41 ECHO-GPU: Short-term improvements Further improvements can be quite easily achieved by: extending the data region oustide the main do while cycle creating all the temporary variables directly on the device memory and copying only the primitive and conservative variables array together with the metric terms and grids avoiding the copy of some variables defined in inlined subfunctions using the new OpenACC features implemented in the PGI compiler release tuning the parallelization with the right choice of the numbers of threads and blocks

42 ECHO-GPU: Long-term improvements Further medium and long-term possible improvements include: implementing an OpenMP parallelization, taking advantage of the work already done to use OpenACC directives using MPI to manage multiple GPUs moving from cilindrycal coordinates to cartesian coordinates implementing a Phyton user interface implementing a parallel HDF5 I/0

43 Thank you for your attention Comment and suggestions are welcome and enjoy your acceleration! mail:

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Getting Started with Directive-based Acceleration: OpenACC

Getting Started with Directive-based Acceleration: OpenACC Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences

More information

PGI Fortran & C Accelerator Programming Model. The Portland Group

PGI Fortran & C Accelerator Programming Model. The Portland Group PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block. API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

RAMSES on the GPU: An OpenACC-Based Approach

RAMSES on the GPU: An OpenACC-Based Approach RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC Nonlinear Computational Aeroelasticity Lab ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC N I C H O L S O N K. KO U K PA I Z A N P H D. C A N D I D AT E GPU Technology Conference

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

GPU Programming Paradigms

GPU Programming Paradigms GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows

More information

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

OpenACC and the Cray Compilation Environment James Beyer PhD

OpenACC and the Cray Compilation Environment James Beyer PhD OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,

More information

OPENMP FOR ACCELERATORS

OPENMP FOR ACCELERATORS 7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models C. Aberle, A. Hakim, and U. Shumlak Aerospace and Astronautics University of Washington, Seattle American Physical Society

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran

More information

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

Speeding Up Reactive Transport Code Using OpenMP. OpenMP

Speeding Up Reactive Transport Code Using OpenMP. OpenMP Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for parallelizing Fortran and C/C++ on shared memory systems Minimal changes to sequential code required Incremental

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview The Portland Group Published: v0.7 November 2008 Contents 1. Introduction... 1 1.1 Scope... 1 1.2 Glossary... 1 1.3 Execution

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

The PGI Fortran and C99 OpenACC Compilers

The PGI Fortran and C99 OpenACC Compilers The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction

More information

High-Order Finite Difference Schemes for computational MHD

High-Order Finite Difference Schemes for computational MHD High-Order Finite Difference Schemes for computational MHD A. Mignone 1, P. Tzeferacos 1 and G. Bodo 2 [1] Dipartimento di Fisica Generale, Turin University, ITALY [2] INAF Astronomic Observatory of Turin,,

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

arxiv: v1 [cs.ms] 8 Aug 2018

arxiv: v1 [cs.ms] 8 Aug 2018 ACCELERATING WAVE-PROPAGATION ALGORITHMS WITH ADAPTIVE MESH REFINEMENT USING THE GRAPHICS PROCESSING UNIT (GPU) XINSHENG QIN, RANDALL LEVEQUE, AND MICHAEL MOTLEY arxiv:1808.02638v1 [cs.ms] 8 Aug 2018 Abstract.

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI

More information

PROFILER OPENACC TUTORIAL. Version 2018

PROFILER OPENACC TUTORIAL. Version 2018 PROFILER OPENACC TUTORIAL Version 2018 TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter 1. 2. 3. 4. 5. Tutorial Setup... 1 Profiling the application... 2 Adding OpenACC directives...4 Improving

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

Advanced OpenACC. Steve Abbott November 17, 2017

Advanced OpenACC. Steve Abbott November 17, 2017 Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information

More information

Compiling a High-level Directive-Based Programming Model for GPGPUs

Compiling a High-level Directive-Based Programming Model for GPGPUs Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

Experiences with CUDA & OpenACC from porting ACME to GPUs

Experiences with CUDA & OpenACC from porting ACME to GPUs Experiences with CUDA & OpenACC from porting ACME to GPUs Matthew Norman Irina Demeshko Jeffrey Larkin Aaron Vose Mark Taylor ORNL is managed by UT-Battelle for the US Department of Energy ORNL Sandia

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

INTRODUCTION TO OPENACC

INTRODUCTION TO OPENACC INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

PGPROF OpenACC Tutorial

PGPROF OpenACC Tutorial PGPROF OpenACC Tutorial Version 2017 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Tutorial Setup...1 Chapter 2. Profiling the application... 2 Chapter 3. Adding OpenACC directives... 4 Chapter

More information

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

Optimizing OpenACC Codes. Peter Messmer, NVIDIA Optimizing OpenACC Codes Peter Messmer, NVIDIA Outline OpenACC in a nutshell Tune an example application Data motion optimization Asynchronous execution Loop scheduling optimizations Interface OpenACC

More information

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région

More information

OpenACC introduction (part 2)

OpenACC introduction (part 2) OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,

More information

High-order, conservative, finite difference schemes for computational MHD

High-order, conservative, finite difference schemes for computational MHD High-order, conservative, finite difference schemes for computational MHD A. Mignone 1, P. Tzeferacos 1 and G. Bodo 2 [1] Dipartimento di Fisica Generale, Turin University, ITALY [2] INAF Astronomic Observatory

More information

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library Synergy@VT Collaborators: Paul Sathre, Sriram Chivukula, Kaixi Hou, Tom Scogland, Harold Trease,

More information

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic

More information

SENSEI / SENSEI-Lite / SENEI-LDC Updates

SENSEI / SENSEI-Lite / SENEI-LDC Updates SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

OpenACC Accelerator Directives. May 3, 2013

OpenACC Accelerator Directives. May 3, 2013 OpenACC Accelerator Directives May 3, 2013 OpenACC is... An API Inspired by OpenMP Implemented by Cray, PGI, CAPS Includes functions to query device(s) Evolving Plan to integrate into OpenMP Support of

More information

Advanced OpenMP. Lecture 11: OpenMP 4.0

Advanced OpenMP. Lecture 11: OpenMP 4.0 Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013 Starting to make an appearance in production compilers What s new in 4.0 User defined reductions Construct cancellation

More information

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

An Introduction to OpenACC - Part 1

An Introduction to OpenACC - Part 1 An Introduction to OpenACC - Part 1 Feng Chen HPC User Services LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 01-03, 2015 Outline of today

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 10-12, 2013 HPC@LSU

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

OpenACC Support in Score-P and Vampir

OpenACC Support in Score-P and Vampir Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Accelerating Harmonie with GPUs (or MICs)

Accelerating Harmonie with GPUs (or MICs) Accelerating Harmonie with GPUs (or MICs) (A view from the starting-point) Enda O Brien, Adam Ralph Irish Centre for High-End Computing Motivation There is constant, insatiable demand for more performance

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer OpenACC 2.5 and Beyond Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com OpenACC Timeline 2008 PGI Accelerator Model (targeting NVIDIA GPUs) 2011 OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs)

More information

Introduction to Compiler Directives with OpenACC

Introduction to Compiler Directives with OpenACC Introduction to Compiler Directives with OpenACC Agenda Fundamentals of Heterogeneous & GPU Computing What are Compiler Directives? Accelerating Applications with OpenACC - Identifying Available Parallelism

More information