OpenACC and the Cray Compilation Environment James Beyer PhD

Size: px
Start display at page:

Download "OpenACC and the Cray Compilation Environment James Beyer PhD"

Transcription

1 OpenACC and the Cray Compilation Environment James Beyer PhD

2 Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC A selection of insights concerning the use of OpenACC Summary 2

3 OpenACC review

4 Contents OpenACC programming model What does OpenACC look like? How are OpenACC directives used? 4

5 OpenACC programming model Host-directed execution with attached GPU Main program executes on host (i.e. CPU) Directs execution on device (i.e. GPU) Memory allocation and transfers Kernel execution Synchronization Memory spaces on the host and device distinct Different locations, different address space Data movement performed by host using runtime library calls that explicitly move data between the separate GPUs have a weak memory model No synchronization possible between outermost parallel level User responsible for Specifying code to run on device Specifying parallelism Specifying data allocation/movement that spans single kernels 5

6 Accelerator directives Modify original source code with directives Non-executable statements (comments, pragmas) Can be ignored by non-accelerating compiler CCE -hnoacc (or -xacc) also supresses compilation Sentinel: acc C/C++: preceded by #pragma Structured block {...} avoids need for end directives Fortran: preceded by!$ (or c$ for FORTRAN77) Usually paired with!$acc end * Directives can be capitalised Continuation to extra lines allowed C/C++: \ (at end of line to be continued) Fortran: Fixed form: c$acc& or!$acc& on continuation line Free form: & at end of line to be continued continuation lines can start with either!$acc or!$acc& // C/C++ example #pragma acc * {structured block}! Fortran example!$acc * <structured block>!$acc end * 6

7 A basic example Execute a loop nest on the GPU Compiler does the work: Data movement allocates/frees GPU memory at start/end of region moves of data to/from GPU!$acc parallel loop DO i = 2,N-1 c(i) = a(i) + b(i) ENDDO!$acc end parallel loop write-only read-only Loop schedule: spreading loop iterations over PEs of GPU OpenACC CUDA gang: a threadblock worker: warp (group of 32 threads) vector: threads within a warp Compiler takes care of cases where iterations doesn't divide threadblock size Caching (explicitly use GPU shared memory for reused data) automatic caching (e.g. NVIDIA Fermi, Kepler) important Tune default behavior with optional clauses on directives 7

8 Cray PE introduction

9 Cray packaged OpenACC Programming Environments Two different OpenACC compilers You select these by loading a Programming Environment module PrgEnv-cray for CCE (the default) PrgEnv-pgi for PGI Once one of these is loaded, you can then select a compiler version CCE: module avail cce PGI: module avail pgi Swap to the most up to date version in each case e.g. "module avail cce" to see the versions available then "module swap cce cce/<whatever>" For any GPU programming (CUDA, OpenCL, OpenACC...) make sure you always: "module load craype-accel-nvidia35" it is not loaded by default sys-admin decides 9

10 Using the compilers You use the compilers via wrapper functions ftn for Fortran; cc for C; CC for C++ it doesn't matter which PrgEnv is loaded (same wrapper name) the wrappers add optimisation options, architecture-specific stuff and all the important library paths make sure module xtpe-<processor type> is loaded so these are correct in many cases, you don't need any other compiler options if you really want unoptimised code, you must use option -O0 Further information man pages for the wrapper commands give you general information For more detail see the compiler-specific man pages CCE: crayftn, craycc, craycc PGI: pgfortran, pgcc You will need the appropriate PrgEnv module loaded to see these 10

11 Some Cray Compilation Environment basics CCE-specific features: Optimisation: -O2 is the default and you should usually use this -O3 activates more aggressive options; could be faster or slower OpenMP: is supported by default. if you don't want it, use -hnoomp compiler flag OpenACC: is enabled automatically when module is loaded CCE only gives minimal information to stderr when compiling -hmsgs to see more information, you should request a compiler listing file flag -hlist=a for ftn and cc writes a file with extension.lst contains annotated source listing, followed by explanatory messages each message is tagged with an identifier, e.g.: ftn-6430 to get more information on this, type: explain <identifier> For a full description of the Cray compilers, see the reference manuals at 11

12 Further information: Compiling CUDA Compilation: module load craype-accel-nvidia35 Main CPU code compiled with PrgEnv "cc" wrapper either PrgEnv-gnu for gcc; or PrgEnv-cray for craycc GPU CUDA-C kernels compiled with nvcc nvcc -O3 -arch=sm_35 PrgEnv "cc" wrapper used for linking Only GPU flag needed: -lcudart e.g. no CUDA -L flags needed (added in cc wrapper) 12

13 CCE 8.2 OpenACC status

14 Contents Cray Compilation Environment (CCE) What does CCE do with X? -hacc_model= 14

15 OpenACC in CCE man intro_openacc Which module to use craype-accel-nvidia20 craype-accel-nvidia35 Forces dynamic linking Single object file Whole program Messages/list file Compiles to PTX not cuda Leverages years of vector code generator experience Debugger sees original program not cuda intermediate 15

16 OpenACC implementation status OpenACC 1.0 features complete _OPENACC change complete Default(none) complete acc_async_sync and acc_async_noval complete Loop nesting clarification matches what we have always done. wait clause on parallel, kernels and update complete Async clause on wait directive complete enter / exit data complete Common block names deferred Link clause complete Multidimensional C/C++ array support complete Tile clause complete/deferred Auto clause complete Device_type complete Routine directive complete Nested parallelism deferred Atomic constructs complete New APIs complete 16

17 What does CCE do with OpenACC constructs (1) Parallel/kernels Flatten all calls that do not have routine constructs on them Package code for kernel Insert data motion to and from device Clauses Autodetect Insert kernel launch code Automatic vectorization is enabled Inserts joins/events for wait clauses Kernels Identify kernels Inserts joins/events for wait clauses Loop Gang Thread Block (TB) Worker warp Vector Threads within a warp or TB Automatic vectorization is enabled Collapse Will only rediscover indices when required Independent Turns off safety/correctness checking for work-sharing of loop Reduction Nontrivial to implement Does not use multiple kernels All loop directives within a loop nest must list to reduction if applicable Tile Similar to collapse Auto Treated as preferred clause for our auto parallelism feature 17

18 What does CCE do with OpenACC constructs (2) Data clause( object list ) create allocate at start register in present-table de-allocate at exit copy, copyin, copyout create plus data copy present Abort at runtime if object is not in present table. present_or_copy, present_or_copyin, present_or_copyout, present_or_create deviceptr Send address directly to kernel without translation. Unstructured Data enter data Same as init part of data construct exit data Delete object from present table Abort at runtime if object is not on device Update Implicit!$acc data present( obj ) For known contiguous memory Transfer (Essentially a CUDA memcpy) Not contiguous memory Pack into contiguous buffer Transfer contiguous Unpack from contiguous buffer 18

19 What does CCE do with OpenACC constructs (3) Cache Create shared memory copies of objects Generate copy into shared memory objects Generate copy out of shared memory objects Release the shared memory Routine construct gang Generate gang-redundant code Worker Generate worker-single code vector Generate vector-single code Seq Generate per thread code Bind( name ) / Bind ( string ) If block with acc_on_device Nohost is ignored Declare construct Implementation completely reworked for 8.2 release Link Create pointer for object on device Replace all references to object in kernels with pointer based references Similar to PIC code Adds fixup code to ensure that device pointers contian correct address after object is moved to the device. Device_resident Places object on device Fortran allocatables not complete 19

20 What does CCE do with OpenACC constructs (4) Atomic construct Maps onto our OpenMP translation system CAS loops for unsupported operators Compiler issues an error if type requires locks Complex(128) 20

21 Extended OpenACC 2.0 runtime routines void cray_acc_update_device_async( void *, size_t, int ); void cray_acc_update_host_async( void *, size_t, int ); void *cray_acc_memcpy_to_host_async( void* destination, const void* source, size_t size, int async_id ); void *cray_acc_memcpy_to_device_async( void* destination, const void * source, size_t size, int async_id ); 21

22 Partitioning clause mappings 1.!$acc loop gang : across thread blocks 2.!$acc loop worker : across warps within a thread block 3.!$acc loop vector : across threads within a warp 1.!$acc loop gang : across thread blocks 2.!$acc loop worker vector : across threads within a thread block 1.!$acc loop gang : across thread blocks 2.!$acc loop vector : same as worker vector 1.!$acc loop gang worker: across thread blocks and the warps within a thread block 2.!$acc loop vector : across threads within a warp 1.!$acc loop gang vector : across thread blocks and threads within a thread block 1.!$acc loop gang worker vector : same as gang vector 22

23 Partitioning clause mappings (cont) You can also force things to be within a single thread block: 1.!$acc loop worker : across warps within a single thread block 2.!$acc loop vector : across threads within a warp 1.!$acc worker vector : across threads within a single thread block 1.!$acc vector : across threads within a single thread block 23

24 -hacc_model options auto_async_(none kernel all) Compiler automatically adds some asynchronous behavior Only overlaps host and accelerator No automatic overlap of different accelerator constructs (single stream) May require some explicit user waits Host_data [no_]fast_addr Uses 32 bit variables/calculations for index expressions Faster address computation Fewer registers [no_]deep_copy Enable automatic deep copy support 24

25 OpenACC insights

26 parallel vs. kernels parallel and kernels regions look very similar both define a region to be accelerated different heritage; different levels of obligation for the compiler parallel prescriptive (like OpenMP programming model) uses a single accelerator kernel to accelerate region compiler will accelerate region (even if this leads to incorrect results) kernels descriptive (like PGI Accelerator programming model) uses one or more accelerator kernels to accelerate region compiler may accelerate region (if decides loop iterations are independent) For more info: Which to use (my opinion) parallel (or parallel loop) offers greater control fits better with the OpenMP model kernels (or kernels loop) better for initially exploring parallelism not knowing if loopnest is accelerated could be a problem 26

27 parallel loop vs. parallel and loop parallel region can span multiple code blocks i.e. sections of serial code statements and/or loopnests loopnests in parallel region are not automatically partitioned need to explicitly use loop directive for this to happen scalar code (serial code, loopnests without loop directive) executed redundantly, i.e. identically by every thread or maybe just by one thread per block (its implementation dependent) There is no synchronisation between redundant code or kernels offers potential for overlap of execution on GPU also offers potential (and likelihood) of race conditions and incorrect code There is no mechanism for a barrier inside a parallel region after all, CUDA offers no barrier on GPU across threadblocks to effect a barrier, end the parallel region and start a new one also use wait directive outside parallel region for extra safety 27

28 parallel loop vs. parallel and loop Some advice: don't... GPU threads are very lightweight (unlike OpenMP) so don't worry about having extra parallel regions explicit use of async clause may achieve same results as using one parallel region but with greater code clarity and better control over overlap... but if you feel you must begin with composite parallel loop and get correct code separate directives with care only as a later performance tuning when you are sure the kernels are independent and no race conditions 28

29 parallel loop vs. parallel and loop When you actually might want to You might split the directive if: you have a single loopnest, and you need explicit control over the loop scheduling you do this with multiple loop directives inside parallel region or you could use parallel loop for the outermost loop, and loop for the others But beware of reduction variables With separate loop directives, you need a reduction clause on every loop directive that includes a reduction, at least with CCE: t = 0!$acc parallel loop &!$acc reduction(+:t) DO j = 1,N DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel loop t = 0!$acc parallel &!$acc reduction(+:t)!$acc loop DO j = 1,N!$acc loop DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel t = 0!$acc parallel!$acc loop reduction(+:t) DO j = 1,N!$acc loop DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel t = 0!$acc parallel!$acc loop reduction(+:t) DO j = 1,N!$acc loop reduction(+:t) DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel Correct! Wrong! Wrong! Correct! 29

30 parallel gotchas No loop directive The code will (or may) run redundantly Every thread does every loop iteration Not usually what we want Serial code in parallel region avoids copyin(t), but a good idea? No! Every thread sets t=0 asynchronicity: no guarantee this finishes before loop kernel starts race condition, unstable answers. Multiple kernels Again, potential race condition Treat OpenACC "end loop" like OpenMP "enddo nowait"!$acc parallel DO i = 1,N a(i) = b(i) + c(i) ENDDO!$acc end parallel!$acc parallel t = 0!$acc loop reduction(+:t) DO i = 1,N t = t + a(i) ENDDO!$acc end parallel!$acc parallel!$acc loop DO i = 1,N a(i) = 2*a(i) ENDDO!$acc loop DO i = 1,N a(i) = a(i) + 1 ENDDO!$acc end parallel 30

31 Declare link int a[100000]; #pragma acc declare link(a) int main() { #pragma acc parallel loop for( int i = 0; i < ; i++ ) { } } int a[100000]; #pragma acc declare link(a) #pragma acc routine gang void foo() {!$acc loop gang worker vector for( int i = 0; i < ; i++ ) { } } int main() { #pragma acc parallel copy(a) foo(); } 31

32 32

33 Summary The Cray Programming Environment support for OpenACC was introduced An in depth look at OpenACC support in CCE was presented A few insights gained while implementing and working with OpenACC were presented Final thoughts There is still work to do in CCE and in OpenACC 33

34 Upcoming GTC Express Webinars November 20 - Improving Performance using the CUDA Memory Model and Features of the Kepler Architecture November 21 - Speeding Up Financial Risk Management Cost Efficiently for Intra-day and Pre-deal CVA Calculations December 3 - CUDA Tools for Optimal Performance and Productivity December 12 - GPU-accelerated High Performance Geospatial Line-of-sight Calculations Register at

35 GTC 2014 Call for Posters Open Posters should describe novel or interesting topics in Science and research Professional graphics Mobile computing Automotive applications Game development Cloud computing Submit for chance to win Best Poster Award

Practical: a sample code

Practical: a sample code Practical: a sample code Alistair Hart Cray Exascale Research Initiative Europe 1 Aims The aim of this practical is to examine, compile and run a simple, pre-prepared OpenACC code The aims of this are:

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block. API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer OpenACC 2.5 and Beyond Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com OpenACC Timeline 2008 PGI Accelerator Model (targeting NVIDIA GPUs) 2011 OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs)

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

Blue Waters Programming Environment

Blue Waters Programming Environment December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

OpenACC Accelerator Directives. May 3, 2013

OpenACC Accelerator Directives. May 3, 2013 OpenACC Accelerator Directives May 3, 2013 OpenACC is... An API Inspired by OpenMP Implemented by Cray, PGI, CAPS Includes functions to query device(s) Evolving Plan to integrate into OpenMP Support of

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018 Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait

More information

A High Level Programming Environment for Accelerated Computing. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

A High Level Programming Environment for Accelerated Computing. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. A High Level Programming Environment for Accelerated Computing Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Outline Motivation Cray XK6 Overview Why a new programming

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

An Introduction to OpenACC

An Introduction to OpenACC An Introduction to OpenACC James Beyer PhD 6.May.2013 1 Timetable Monday 6 th May 2013 8:30 Lecture 1: Introduction to the Cray XK7 (15) 8:45 Lecture 2: OpenACC organization (Duncan Poole) (15) 9:00 Lecture

More information

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers

More information

PGI Fortran & C Accelerator Programming Model. The Portland Group

PGI Fortran & C Accelerator Programming Model. The Portland Group PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

OpenACC compiling and performance tips. May 3, 2013

OpenACC compiling and performance tips. May 3, 2013 OpenACC compiling and performance tips May 3, 2013 OpenACC compiler support Cray Module load PrgEnv-cray craype-accel-nvidia35 Fortran -h acc, noomp # openmp is enabled by default, be careful mixing -fpic

More information

Advanced OpenACC. Steve Abbott November 17, 2017

Advanced OpenACC. Steve Abbott November 17, 2017 Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

GPU Computing with OpenACC Directives

GPU Computing with OpenACC Directives GPU Computing with OpenACC Directives Alexey Romanenko Based on Jeff Larkin s PPTs 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration

More information

Introduction to Compiler Directives with OpenACC

Introduction to Compiler Directives with OpenACC Introduction to Compiler Directives with OpenACC Agenda Fundamentals of Heterogeneous & GPU Computing What are Compiler Directives? Accelerating Applications with OpenACC - Identifying Available Parallelism

More information

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism

More information

Alistair Hart, Roberto Ansaloni, Alan Gray, Kevin Stratford (EPCC) ( Cray Exascale Research Initiative Europe)

Alistair Hart, Roberto Ansaloni, Alan Gray, Kevin Stratford (EPCC) ( Cray Exascale Research Initiative Europe) Alistair Hart, Roberto Ansaloni, Alan Gray, Kevin Stratford (EPCC) ( Cray Exascale Research Initiative Europe) UKGPUCC3 Goodenough College, London Wed. 14.Dec.11 ahart@cray.com Contents The Now: the new

More information

The PGI Fortran and C99 OpenACC Compilers

The PGI Fortran and C99 OpenACC Compilers The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction

More information

OPENMP FOR ACCELERATORS

OPENMP FOR ACCELERATORS 7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

Optimizing OpenACC Codes. Peter Messmer, NVIDIA Optimizing OpenACC Codes Peter Messmer, NVIDIA Outline OpenACC in a nutshell Tune an example application Data motion optimization Asynchronous execution Loop scheduling optimizations Interface OpenACC

More information

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

Heidi Poxon Cray Inc.

Heidi Poxon Cray Inc. Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs

More information

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI

More information

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Compiling applications for the Cray XC

Compiling applications for the Cray XC Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers

More information

Compiling a High-level Directive-Based Programming Model for GPGPUs

Compiling a High-level Directive-Based Programming Model for GPGPUs Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University

More information

Getting Started with Directive-based Acceleration: OpenACC

Getting Started with Directive-based Acceleration: OpenACC Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences

More information

Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage

Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage INFN - Università di Parma November 6, 2012 Table of contents 1 Introduction 2 3 4 Let

More information

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC

More information

Motivation OpenACC and OpenMP for Accelerators Cray Compilation Environment (CCE) Examples

Motivation OpenACC and OpenMP for Accelerators Cray Compilation Environment (CCE) Examples Dr. James C. Beyer Motivation OpenACC and OpenMP for Accelerators Cray Compilation Environment (CCE) Examples Sum elements of an array Original Fortran code a=0.0 do i = 1,n a = a + b(i) end do 3 global

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview The Portland Group Published: v0.7 November 2008 Contents 1. Introduction... 1 1.1 Scope... 1 1.2 Glossary... 1 1.3 Execution

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

Directive-based Programming for Highly-scalable Nodes

Directive-based Programming for Highly-scalable Nodes Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,

More information

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016 OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

CS 470 Spring Mike Lam, Professor. Advanced OpenMP CS 470 Spring 2018 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement

More information

Adrian Tate XK6 / openacc workshop Manno, Mar

Adrian Tate XK6 / openacc workshop Manno, Mar Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous

More information

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends Overview Lecture 6: odds and ends Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre synchronicity multiple streams and devices multiple GPUs other

More information

Programming Environment 4/11/2015

Programming Environment 4/11/2015 Programming Environment 4/11/2015 1 Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent interface

More information

Lecture 6: odds and ends

Lecture 6: odds and ends Lecture 6: odds and ends Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 6 p. 1 Overview synchronicity multiple streams and devices

More information

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

OpenACC. Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith

OpenACC. Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith OpenACC Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith 1 Introduction For this introduction, we are assuming you are familiar with libraries that use a pragma directive based structure,

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

OPENACC ONLINE COURSE 2018

OPENACC ONLINE COURSE 2018 OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week

More information

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran

More information

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

ADVANCED OPENACC PROGRAMMING

ADVANCED OPENACC PROGRAMMING ADVANCED OPENACC PROGRAMMING DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES AGENDA Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU

More information

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Accelerator programming with OpenACC

Accelerator programming with OpenACC ..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling

More information

Never forget Always use the ftn, cc, and CC wrappers

Never forget Always use the ftn, cc, and CC wrappers Using Compilers 2 Never forget Always use the ftn, cc, and CC wrappers The wrappers uses your module environment to get all libraries and include directories for you. You don t have to know their real

More information

DATA-MANAGEMENT DIRECTORY FOR OPENMP 4.0 AND OPENACC

DATA-MANAGEMENT DIRECTORY FOR OPENMP 4.0 AND OPENACC DATA-MANAGEMENT DIRECTORY FOR OPENMP 4.0 AND OPENACC Heteropar 2013 Julien Jaeger, Patrick Carribault, Marc Pérache CEA, DAM, DIF F-91297 ARPAJON, FRANCE 26 AUGUST 2013 24 AOÛT 2013 CEA 26 AUGUST 2013

More information

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP

More information

Automatic Testing of OpenACC Applications

Automatic Testing of OpenACC Applications Automatic Testing of OpenACC Applications Khalid Ahmad School of Computing/University of Utah Michael Wolfe NVIDIA/PGI November 13 th, 2017 Why Test? When optimizing or porting Validate the optimization

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

INTRODUCTION TO OPENACC

INTRODUCTION TO OPENACC INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Portable and Productive Performance on Hybrid Systems with OpenACC Compilers and Tools

Portable and Productive Performance on Hybrid Systems with OpenACC Compilers and Tools Portable and Productive Performance on Hybrid Systems with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Major Hybrid Multi Petaflop Systems

More information

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application

More information

Compiler Optimizations. Aniello Esposito HPC Saudi, March 15 th 2016

Compiler Optimizations. Aniello Esposito HPC Saudi, March 15 th 2016 Compiler Optimizations Aniello Esposito HPC Saudi, March 15 th 2016 Using Compiler Feedback Compilers can generate annotated listing of your source code indicating important optimizations. Useful for targeted

More information