OpenACC and the Cray Compilation Environment James Beyer PhD
|
|
- Kelly McDaniel
- 5 years ago
- Views:
Transcription
1 OpenACC and the Cray Compilation Environment James Beyer PhD
2 Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC A selection of insights concerning the use of OpenACC Summary 2
3 OpenACC review
4 Contents OpenACC programming model What does OpenACC look like? How are OpenACC directives used? 4
5 OpenACC programming model Host-directed execution with attached GPU Main program executes on host (i.e. CPU) Directs execution on device (i.e. GPU) Memory allocation and transfers Kernel execution Synchronization Memory spaces on the host and device distinct Different locations, different address space Data movement performed by host using runtime library calls that explicitly move data between the separate GPUs have a weak memory model No synchronization possible between outermost parallel level User responsible for Specifying code to run on device Specifying parallelism Specifying data allocation/movement that spans single kernels 5
6 Accelerator directives Modify original source code with directives Non-executable statements (comments, pragmas) Can be ignored by non-accelerating compiler CCE -hnoacc (or -xacc) also supresses compilation Sentinel: acc C/C++: preceded by #pragma Structured block {...} avoids need for end directives Fortran: preceded by!$ (or c$ for FORTRAN77) Usually paired with!$acc end * Directives can be capitalised Continuation to extra lines allowed C/C++: \ (at end of line to be continued) Fortran: Fixed form: c$acc& or!$acc& on continuation line Free form: & at end of line to be continued continuation lines can start with either!$acc or!$acc& // C/C++ example #pragma acc * {structured block}! Fortran example!$acc * <structured block>!$acc end * 6
7 A basic example Execute a loop nest on the GPU Compiler does the work: Data movement allocates/frees GPU memory at start/end of region moves of data to/from GPU!$acc parallel loop DO i = 2,N-1 c(i) = a(i) + b(i) ENDDO!$acc end parallel loop write-only read-only Loop schedule: spreading loop iterations over PEs of GPU OpenACC CUDA gang: a threadblock worker: warp (group of 32 threads) vector: threads within a warp Compiler takes care of cases where iterations doesn't divide threadblock size Caching (explicitly use GPU shared memory for reused data) automatic caching (e.g. NVIDIA Fermi, Kepler) important Tune default behavior with optional clauses on directives 7
8 Cray PE introduction
9 Cray packaged OpenACC Programming Environments Two different OpenACC compilers You select these by loading a Programming Environment module PrgEnv-cray for CCE (the default) PrgEnv-pgi for PGI Once one of these is loaded, you can then select a compiler version CCE: module avail cce PGI: module avail pgi Swap to the most up to date version in each case e.g. "module avail cce" to see the versions available then "module swap cce cce/<whatever>" For any GPU programming (CUDA, OpenCL, OpenACC...) make sure you always: "module load craype-accel-nvidia35" it is not loaded by default sys-admin decides 9
10 Using the compilers You use the compilers via wrapper functions ftn for Fortran; cc for C; CC for C++ it doesn't matter which PrgEnv is loaded (same wrapper name) the wrappers add optimisation options, architecture-specific stuff and all the important library paths make sure module xtpe-<processor type> is loaded so these are correct in many cases, you don't need any other compiler options if you really want unoptimised code, you must use option -O0 Further information man pages for the wrapper commands give you general information For more detail see the compiler-specific man pages CCE: crayftn, craycc, craycc PGI: pgfortran, pgcc You will need the appropriate PrgEnv module loaded to see these 10
11 Some Cray Compilation Environment basics CCE-specific features: Optimisation: -O2 is the default and you should usually use this -O3 activates more aggressive options; could be faster or slower OpenMP: is supported by default. if you don't want it, use -hnoomp compiler flag OpenACC: is enabled automatically when module is loaded CCE only gives minimal information to stderr when compiling -hmsgs to see more information, you should request a compiler listing file flag -hlist=a for ftn and cc writes a file with extension.lst contains annotated source listing, followed by explanatory messages each message is tagged with an identifier, e.g.: ftn-6430 to get more information on this, type: explain <identifier> For a full description of the Cray compilers, see the reference manuals at 11
12 Further information: Compiling CUDA Compilation: module load craype-accel-nvidia35 Main CPU code compiled with PrgEnv "cc" wrapper either PrgEnv-gnu for gcc; or PrgEnv-cray for craycc GPU CUDA-C kernels compiled with nvcc nvcc -O3 -arch=sm_35 PrgEnv "cc" wrapper used for linking Only GPU flag needed: -lcudart e.g. no CUDA -L flags needed (added in cc wrapper) 12
13 CCE 8.2 OpenACC status
14 Contents Cray Compilation Environment (CCE) What does CCE do with X? -hacc_model= 14
15 OpenACC in CCE man intro_openacc Which module to use craype-accel-nvidia20 craype-accel-nvidia35 Forces dynamic linking Single object file Whole program Messages/list file Compiles to PTX not cuda Leverages years of vector code generator experience Debugger sees original program not cuda intermediate 15
16 OpenACC implementation status OpenACC 1.0 features complete _OPENACC change complete Default(none) complete acc_async_sync and acc_async_noval complete Loop nesting clarification matches what we have always done. wait clause on parallel, kernels and update complete Async clause on wait directive complete enter / exit data complete Common block names deferred Link clause complete Multidimensional C/C++ array support complete Tile clause complete/deferred Auto clause complete Device_type complete Routine directive complete Nested parallelism deferred Atomic constructs complete New APIs complete 16
17 What does CCE do with OpenACC constructs (1) Parallel/kernels Flatten all calls that do not have routine constructs on them Package code for kernel Insert data motion to and from device Clauses Autodetect Insert kernel launch code Automatic vectorization is enabled Inserts joins/events for wait clauses Kernels Identify kernels Inserts joins/events for wait clauses Loop Gang Thread Block (TB) Worker warp Vector Threads within a warp or TB Automatic vectorization is enabled Collapse Will only rediscover indices when required Independent Turns off safety/correctness checking for work-sharing of loop Reduction Nontrivial to implement Does not use multiple kernels All loop directives within a loop nest must list to reduction if applicable Tile Similar to collapse Auto Treated as preferred clause for our auto parallelism feature 17
18 What does CCE do with OpenACC constructs (2) Data clause( object list ) create allocate at start register in present-table de-allocate at exit copy, copyin, copyout create plus data copy present Abort at runtime if object is not in present table. present_or_copy, present_or_copyin, present_or_copyout, present_or_create deviceptr Send address directly to kernel without translation. Unstructured Data enter data Same as init part of data construct exit data Delete object from present table Abort at runtime if object is not on device Update Implicit!$acc data present( obj ) For known contiguous memory Transfer (Essentially a CUDA memcpy) Not contiguous memory Pack into contiguous buffer Transfer contiguous Unpack from contiguous buffer 18
19 What does CCE do with OpenACC constructs (3) Cache Create shared memory copies of objects Generate copy into shared memory objects Generate copy out of shared memory objects Release the shared memory Routine construct gang Generate gang-redundant code Worker Generate worker-single code vector Generate vector-single code Seq Generate per thread code Bind( name ) / Bind ( string ) If block with acc_on_device Nohost is ignored Declare construct Implementation completely reworked for 8.2 release Link Create pointer for object on device Replace all references to object in kernels with pointer based references Similar to PIC code Adds fixup code to ensure that device pointers contian correct address after object is moved to the device. Device_resident Places object on device Fortran allocatables not complete 19
20 What does CCE do with OpenACC constructs (4) Atomic construct Maps onto our OpenMP translation system CAS loops for unsupported operators Compiler issues an error if type requires locks Complex(128) 20
21 Extended OpenACC 2.0 runtime routines void cray_acc_update_device_async( void *, size_t, int ); void cray_acc_update_host_async( void *, size_t, int ); void *cray_acc_memcpy_to_host_async( void* destination, const void* source, size_t size, int async_id ); void *cray_acc_memcpy_to_device_async( void* destination, const void * source, size_t size, int async_id ); 21
22 Partitioning clause mappings 1.!$acc loop gang : across thread blocks 2.!$acc loop worker : across warps within a thread block 3.!$acc loop vector : across threads within a warp 1.!$acc loop gang : across thread blocks 2.!$acc loop worker vector : across threads within a thread block 1.!$acc loop gang : across thread blocks 2.!$acc loop vector : same as worker vector 1.!$acc loop gang worker: across thread blocks and the warps within a thread block 2.!$acc loop vector : across threads within a warp 1.!$acc loop gang vector : across thread blocks and threads within a thread block 1.!$acc loop gang worker vector : same as gang vector 22
23 Partitioning clause mappings (cont) You can also force things to be within a single thread block: 1.!$acc loop worker : across warps within a single thread block 2.!$acc loop vector : across threads within a warp 1.!$acc worker vector : across threads within a single thread block 1.!$acc vector : across threads within a single thread block 23
24 -hacc_model options auto_async_(none kernel all) Compiler automatically adds some asynchronous behavior Only overlaps host and accelerator No automatic overlap of different accelerator constructs (single stream) May require some explicit user waits Host_data [no_]fast_addr Uses 32 bit variables/calculations for index expressions Faster address computation Fewer registers [no_]deep_copy Enable automatic deep copy support 24
25 OpenACC insights
26 parallel vs. kernels parallel and kernels regions look very similar both define a region to be accelerated different heritage; different levels of obligation for the compiler parallel prescriptive (like OpenMP programming model) uses a single accelerator kernel to accelerate region compiler will accelerate region (even if this leads to incorrect results) kernels descriptive (like PGI Accelerator programming model) uses one or more accelerator kernels to accelerate region compiler may accelerate region (if decides loop iterations are independent) For more info: Which to use (my opinion) parallel (or parallel loop) offers greater control fits better with the OpenMP model kernels (or kernels loop) better for initially exploring parallelism not knowing if loopnest is accelerated could be a problem 26
27 parallel loop vs. parallel and loop parallel region can span multiple code blocks i.e. sections of serial code statements and/or loopnests loopnests in parallel region are not automatically partitioned need to explicitly use loop directive for this to happen scalar code (serial code, loopnests without loop directive) executed redundantly, i.e. identically by every thread or maybe just by one thread per block (its implementation dependent) There is no synchronisation between redundant code or kernels offers potential for overlap of execution on GPU also offers potential (and likelihood) of race conditions and incorrect code There is no mechanism for a barrier inside a parallel region after all, CUDA offers no barrier on GPU across threadblocks to effect a barrier, end the parallel region and start a new one also use wait directive outside parallel region for extra safety 27
28 parallel loop vs. parallel and loop Some advice: don't... GPU threads are very lightweight (unlike OpenMP) so don't worry about having extra parallel regions explicit use of async clause may achieve same results as using one parallel region but with greater code clarity and better control over overlap... but if you feel you must begin with composite parallel loop and get correct code separate directives with care only as a later performance tuning when you are sure the kernels are independent and no race conditions 28
29 parallel loop vs. parallel and loop When you actually might want to You might split the directive if: you have a single loopnest, and you need explicit control over the loop scheduling you do this with multiple loop directives inside parallel region or you could use parallel loop for the outermost loop, and loop for the others But beware of reduction variables With separate loop directives, you need a reduction clause on every loop directive that includes a reduction, at least with CCE: t = 0!$acc parallel loop &!$acc reduction(+:t) DO j = 1,N DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel loop t = 0!$acc parallel &!$acc reduction(+:t)!$acc loop DO j = 1,N!$acc loop DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel t = 0!$acc parallel!$acc loop reduction(+:t) DO j = 1,N!$acc loop DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel t = 0!$acc parallel!$acc loop reduction(+:t) DO j = 1,N!$acc loop reduction(+:t) DO i = 1,N t = t + a(i,j) ENDDO ENDDO!$acc end parallel Correct! Wrong! Wrong! Correct! 29
30 parallel gotchas No loop directive The code will (or may) run redundantly Every thread does every loop iteration Not usually what we want Serial code in parallel region avoids copyin(t), but a good idea? No! Every thread sets t=0 asynchronicity: no guarantee this finishes before loop kernel starts race condition, unstable answers. Multiple kernels Again, potential race condition Treat OpenACC "end loop" like OpenMP "enddo nowait"!$acc parallel DO i = 1,N a(i) = b(i) + c(i) ENDDO!$acc end parallel!$acc parallel t = 0!$acc loop reduction(+:t) DO i = 1,N t = t + a(i) ENDDO!$acc end parallel!$acc parallel!$acc loop DO i = 1,N a(i) = 2*a(i) ENDDO!$acc loop DO i = 1,N a(i) = a(i) + 1 ENDDO!$acc end parallel 30
31 Declare link int a[100000]; #pragma acc declare link(a) int main() { #pragma acc parallel loop for( int i = 0; i < ; i++ ) { } } int a[100000]; #pragma acc declare link(a) #pragma acc routine gang void foo() {!$acc loop gang worker vector for( int i = 0; i < ; i++ ) { } } int main() { #pragma acc parallel copy(a) foo(); } 31
32 32
33 Summary The Cray Programming Environment support for OpenACC was introduced An in depth look at OpenACC support in CCE was presented A few insights gained while implementing and working with OpenACC were presented Final thoughts There is still work to do in CCE and in OpenACC 33
34 Upcoming GTC Express Webinars November 20 - Improving Performance using the CUDA Memory Model and Features of the Kepler Architecture November 21 - Speeding Up Financial Risk Management Cost Efficiently for Intra-day and Pre-deal CVA Calculations December 3 - CUDA Tools for Optimal Performance and Productivity December 12 - GPU-accelerated High Performance Geospatial Line-of-sight Calculations Register at
35 GTC 2014 Call for Posters Open Posters should describe novel or interesting topics in Science and research Professional graphics Mobile computing Automotive applications Game development Cloud computing Submit for chance to win Best Poster Award
Practical: a sample code
Practical: a sample code Alistair Hart Cray Exascale Research Initiative Europe 1 Aims The aim of this practical is to examine, compile and run a simple, pre-prepared OpenACC code The aims of this are:
More informationComparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015
Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationAn OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.
API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationOpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer
OpenACC 2.5 and Beyond Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com OpenACC Timeline 2008 PGI Accelerator Model (targeting NVIDIA GPUs) 2011 OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs)
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationBlue Waters Programming Environment
December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationProgramming paradigms for GPU devices
Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate
More informationOpenACC Accelerator Directives. May 3, 2013
OpenACC Accelerator Directives May 3, 2013 OpenACC is... An API Inspired by OpenMP Implemented by Cray, PGI, CAPS Includes functions to query device(s) Evolving Plan to integrate into OpenMP Support of
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationINTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies
INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationA High Level Programming Environment for Accelerated Computing. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
A High Level Programming Environment for Accelerated Computing Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Outline Motivation Cray XK6 Overview Why a new programming
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationAn Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel
An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler
More informationAn Introduction to OpenACC
An Introduction to OpenACC James Beyer PhD 6.May.2013 1 Timetable Monday 6 th May 2013 8:30 Lecture 1: Introduction to the Cray XK7 (15) 8:45 Lecture 2: OpenACC organization (Duncan Poole) (15) 9:00 Lecture
More informationOPENACC DIRECTIVES FOR ACCELERATORS NVIDIA
OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers
More informationPGI Fortran & C Accelerator Programming Model. The Portland Group
PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5
More informationIntroduction to OpenACC. 16 May 2013
Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics
More informationOpenACC compiling and performance tips. May 3, 2013
OpenACC compiling and performance tips May 3, 2013 OpenACC compiler support Cray Module load PrgEnv-cray craype-accel-nvidia35 Fortran -h acc, noomp # openmp is enabled by default, be careful mixing -fpic
More informationAdvanced OpenACC. Steve Abbott November 17, 2017
Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information
More informationINTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC
INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives
More informationGPU Computing with OpenACC Directives
GPU Computing with OpenACC Directives Alexey Romanenko Based on Jeff Larkin s PPTs 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration
More informationIntroduction to Compiler Directives with OpenACC
Introduction to Compiler Directives with OpenACC Agenda Fundamentals of Heterogeneous & GPU Computing What are Compiler Directives? Accelerating Applications with OpenACC - Identifying Available Parallelism
More informationObjective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC
GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationIntroduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA
Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism
More informationAlistair Hart, Roberto Ansaloni, Alan Gray, Kevin Stratford (EPCC) ( Cray Exascale Research Initiative Europe)
Alistair Hart, Roberto Ansaloni, Alan Gray, Kevin Stratford (EPCC) ( Cray Exascale Research Initiative Europe) UKGPUCC3 Goodenough College, London Wed. 14.Dec.11 ahart@cray.com Contents The Now: the new
More informationThe PGI Fortran and C99 OpenACC Compilers
The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction
More informationOPENMP FOR ACCELERATORS
7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationOptimizing OpenACC Codes. Peter Messmer, NVIDIA
Optimizing OpenACC Codes Peter Messmer, NVIDIA Outline OpenACC in a nutshell Tune an example application Data motion optimization Asynchronous execution Loop scheduling optimizations Interface OpenACC
More informationPortable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationHeidi Poxon Cray Inc.
Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs
More informationParallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian
Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI
More informationADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA
ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationCompiling applications for the Cray XC
Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers
More informationCompiling a High-level Directive-Based Programming Model for GPGPUs
Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University
More informationGetting Started with Directive-based Acceleration: OpenACC
Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences
More informationOptimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage
Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage INFN - Università di Parma November 6, 2012 Table of contents 1 Introduction 2 3 4 Let
More informationINTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016
INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC
More informationMotivation OpenACC and OpenMP for Accelerators Cray Compilation Environment (CCE) Examples
Dr. James C. Beyer Motivation OpenACC and OpenMP for Accelerators Cray Compilation Environment (CCE) Examples Sum elements of an array Original Fortran code a=0.0 do i = 1,n a = a + b(i) end do 3 global
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationPGI Fortran & C Accelerator Compilers and Programming Model Technology Preview
PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview The Portland Group Published: v0.7 November 2008 Contents 1. Introduction... 1 1.1 Scope... 1 1.2 Glossary... 1 1.3 Execution
More informationOpenACC Course Lecture 1: Introduction to OpenACC September 2015
OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationDirective-based Programming for Highly-scalable Nodes
Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationLattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)
Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,
More informationOpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016
OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationOpenACC Fundamentals. Steve Abbott November 13, 2016
OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationPortable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationLecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2018 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationAdrian Tate XK6 / openacc workshop Manno, Mar
Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous
More informationOverview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends
Overview Lecture 6: odds and ends Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre synchronicity multiple streams and devices multiple GPUs other
More informationProgramming Environment 4/11/2015
Programming Environment 4/11/2015 1 Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent interface
More informationLecture 6: odds and ends
Lecture 6: odds and ends Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 6 p. 1 Overview synchronicity multiple streams and devices
More informationPortability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17
Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17 11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationOpenACC. Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith
OpenACC Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith 1 Introduction For this introduction, we are assuming you are familiar with libraries that use a pragma directive based structure,
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationOPENACC ONLINE COURSE 2018
OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week
More informationIs OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels
National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran
More informationPragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationADVANCED OPENACC PROGRAMMING
ADVANCED OPENACC PROGRAMMING DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES AGENDA Optimizing OpenACC Loops Routines Update Directive Asynchronous Programming Multi-GPU
More informationMULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA
MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationAccelerator programming with OpenACC
..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling
More informationNever forget Always use the ftn, cc, and CC wrappers
Using Compilers 2 Never forget Always use the ftn, cc, and CC wrappers The wrappers uses your module environment to get all libraries and include directories for you. You don t have to know their real
More informationDATA-MANAGEMENT DIRECTORY FOR OPENMP 4.0 AND OPENACC
DATA-MANAGEMENT DIRECTORY FOR OPENMP 4.0 AND OPENACC Heteropar 2013 Julien Jaeger, Patrick Carribault, Marc Pérache CEA, DAM, DIF F-91297 ARPAJON, FRANCE 26 AUGUST 2013 24 AOÛT 2013 CEA 26 AUGUST 2013
More informationEE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California
EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP
More informationAutomatic Testing of OpenACC Applications
Automatic Testing of OpenACC Applications Khalid Ahmad School of Computing/University of Utah Michael Wolfe NVIDIA/PGI November 13 th, 2017 Why Test? When optimizing or porting Validate the optimization
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationINTRODUCTION TO OPENACC
INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationPortable and Productive Performance on Hybrid Systems with OpenACC Compilers and Tools
Portable and Productive Performance on Hybrid Systems with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Major Hybrid Multi Petaflop Systems
More informationGPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation
GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application
More informationCompiler Optimizations. Aniello Esposito HPC Saudi, March 15 th 2016
Compiler Optimizations Aniello Esposito HPC Saudi, March 15 th 2016 Using Compiler Feedback Compilers can generate annotated listing of your source code indicating important optimizations. Useful for targeted
More information