Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage
|
|
- Jeremy Hamilton
- 5 years ago
- Views:
Transcription
1 Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage INFN - Università di Parma November 6, 2012
2 Table of contents 1 Introduction 2 3 4
3 Let me introduce myself I am a 2 nd year PhD Student in Physics at Parma University Master Thesis in Florence with L. Del Zanna: oscillations of rotating Neutron Stars (ECHO code) PhD in Parma with R. De Pietri: dynamical instabilities of rotating NS with Magnetic Fields in GR (Cactus/Whisky) PRACE Winter School 2012 here in CINECA I would like to have a fast working code to perform 2D simulations of NS on a cluster with GPUs The starting point is the 2- and 3-D code ECHO The ECHO code would need to be optimized, fully parallelized and ported on GPUs
4 General Relativistic Magnetohydrodynamics (GRMHD) GRMHD is the study of magnetized fluid flows in general relativistic spacetimes It is required for the study of extremely compact astrophysical objects, like neutron stars and black holes The spacetime is a solution of Einsteins field equations of general relativity, and can be either solved for dynamically or assumed given by the initial data (Cowling approximation) Solution techniques are similar to traditional Computational Fluid Dynamics, with some important differences
5 : Overview Eulerian Conservative High Order code: the aim is to combine shock-capturing properties and accuracy for small scale wave propagation and turbulence, in a 3+1 approach (L. Del Zanna, O. Zanotti, N. Bucciantini, P. Londrillo, 2007, A&A 473, 11) GR upgrade of: Londrillo & Del Zanna 2000; Del Zanna et al. 2002, 2003 Modular structure, F90 language, MPI parallelization Any metric allowed (1-, 2- or 3-D), even time-dependent Finite-difference scheme, Runge-Kutta time-stepping UCT strategy for the magnetic field (staggered grid) Central-type Riemann solvers (LLF, HLL, HLLC) Upgrades: resistivity, radiation HD, evolving spacetimes
6 : numerical scheme The two sets of conservation laws are discretized in space according the Upwind Constrained Transport strategies (UCT: Londrillo & Del Zanna ApJ 530, 508, 2000; JCP 195, 17, 2004) Staggered grid for magnetic and electric field components Finite differences: point values at cell centers (u), at cell faces (b and f), at edges (e).
7 : Upwind Constrained Transport The divergence-free constraint is satisfied algebraically at 2 nd spatial order Single upwind state for B i along direction i Four-state numerical fluxes for the magnetic field at edges
8 : the evolution scheme (1/3) The ECHO evolution scheme needs to: Compute the primitive fluid variables from the conservative ones (see next) and interpolate the B components at cell center 8 primitive variables at cell center: P = [ ρ, v, p, B] T For every direction, fill the boundary ghost zones with the values of the primitives and reconstruct left (L) and right /R) upwind states at interfaces (B i along i is unchanged) P L,R i+1/2, j = RL,R ({P ij }) Compute metric properties at interfaces interpolating the metrics defined on the grid (only at t = 0 in Cowling approx.)
9 : the evolution scheme (2/3) Compute the upwind fluxes for the fluid part from the primitive variables using an approximate Riemann solver (save the fast-magnetosonic and transverse transport speeds) For every direction, compute the spatial derivatives of the fluxes at cell center and maximum wave speed Compute geometrical source terms (right-hand side of the conservation laws) Reconstruct B e v from cell faces to cell edge Compute numerical fluxes for the magnetic field (upwind electric fields) and their derivatives
10 : the evolution scheme (3/3) Compute the maximum time step obeying the CFL (Courant-Friedrichs-Lewy) condition 0 < c < 1: t = c max i (a i M /h i) (am i are the maximum speeds over the whole domain, for each direction i) Update all the variables using fluxes and electric fields derivatives with a 2 nd order Runga-Kutta scheme 1 u (1) = u n + t R(u n ) 2 u n+1 = 1 2 un [ ] u (1) + t R(u (1) )
11 ECHO: The time evolution routine
12 Optimizations The main optimizations on the original code are the following: implicit none was added to all subroutines so that all the variables are now explicitly declared all the variables read and written by a subroutine are now passed as arguments and declared with their own intent many do cycles were conveniently merged or re-arranged some routines have been splitted in different routines for different cases many useless temporary array variables were eliminated all the physicical parameters, the evolution parameters and the grid and boundaries settings are now contained in one file instead of being scattered among many source files
13 CPUs vs GPUs CPUs design multi-core sophisticated control logic unit large cache memories to reduce access latencies GPUs design many-cores (several hundreds) minimized control logic in order to manage leightweight threads and maximize execution throughput large number of threads to overcome long-latency memory accesses
14 OpenACC: What is it? is a programming API standard to program accelerators is portable across operating systems and various types of host CPUs and accelerators. allows programmers to provide directives to the compiler identifying which areas of code to accelerate, without requiring programmers to modify or adapt the code itself is aimed at incremental development of accelerator code
15 OpenACC: About the standard announced at the SC11 conference (Seattle, November 2011) offers portability between compilers drawn up by: NVIDIA, Cray, PGI, CAPS works for Fortran, C, C++ standard available at current version: 1.0a (April 2012) work is now targeting additional features for v1.1 compiler support: all complete in 2012 (PGI full support from version 12.6, late August)
16 OpenACC: Execution model Host-directed execution with attached GPU accelerator Main program executes on host (i.e. CPU) Compute-intensive regions offloaded to the accelerator device under control of the host. device (i.e. GPU) executes parallel regions typically contain kernels (i.e. work-sharing loops), or kernels regions, containing loops executed as kernels. Host must orchestrate the execution by: allocating memory on the accelerator device, initiating data transfer, sending the code to the accelerator, passing arguments to the parallel region, queuing the device code, waiting for completion, transferring results back to the host, and deallocating memory.
17 OpenACC: Memory model Memory spaces on the host and device distinct Different locations, different address space Data movement performed by host using runtime library calls that explicitly move data GPUs have a weak memory model No synchronisation between different execution units (SMs) (unless explicit memory barrier) Can write OpenACC kernels with race conditions Giving inconsistent execution results Compiler will catch most errors, but not all OpenACC data movement between the memories implicit managed by the compiler, based on directives from the programmer. Device memory caches are managed by the compiler with hints from the programmer in the form of directives.
18 OpenACC: Benefits Modify original source code with directives Non-executable statements (comments, pragmas) Can be ignored by non-accelerating compiler Don t need separate source-base, much more portable Can preserve subprogram structure Familiar programming model if used traditional OpenMP A small performance gap acceptable (target is 10-15%, currently seeing better than that for many cases) An open standard is the most attractive for developers Portability, multiple compilers for debugging
19 OpenACC: Levels of parallelism The model target architecture is a collection of processing elements or PEs, where each PE is multithreaded, and each thread on the PE can execute vector instructions. The OpenACC execution model has three levels of parallelism: the gang dimension would map across the PEs (CUDA blocks) the worker dimension across the multithreading dimension within a PE (warps) the vector dimension to the vector instructions (threads within a warp) There is no support for any synchronization between gangs, since current accelerators typically do not support synchronization across PEs.
20 OpenACC: Categories of OpenACC APIs Accelerator Parallel Region / Kernels Directives Loop Directives Data Declaration Directives Data Regions Directives Wait / update directives Runtime Library Routines Environment variables
21 OpenACC: Accelerator directives Fortran: sentinel:!$acc * paired with!$acc end * extra lines: & +!$acc& C/C++: sentinel: #pragma acc * structured block {... } avoids need for end directives extra lines: \ (at line end)!$acc directive-name [clause [,clause]]
22 OpenACC: Parallel Directive 1/2 Starts parallel execution on accelerator Specified by:!$acc parallel [clause [,clause]] When encountered: Gangs of workers threads are created to execute on accelerator One worker in each gang begins executing the code following the structured block Number of gangs/workers remains constant in parallel region
23 OpenACC: Parallel Directive 2/2 The clauses for the!$acc parallel directive are: if(condition) async [(scalar-integer-expression)] num gangs, num workers, vector length (scalar-integer-expr.) reduction (operator:list) copy, copyin, copyout (list) create (list) private (list) present, present or... (list) If async is not present, there is an implicit barrier at the end of accelerator parallel region. present or copy default for aggregate types (arrays) private or copy default for scalar variables
24 OpenACC: Kernels Directive Defines a region of a program that is to be compiled into a sequence of kernels for execution on the accelerator Each loop nest will be a different kernel Kernels launched in order in device Specified by:!$acc kernels [clause [,clause]] Kernels directive may not contain nested parallel or kernel directive Configuration of gangs and worker thread may be different for each kernel If async is present, kernels or parallel region will execute asynchronous on accelerator present or copy default for aggregate types (arrays) private or copy default for scalar variables
25 OpenACC: Loop Directive Used to describe what type of parallelism to use to execute the loop in the accelerator Can be used to declare loop-private variables, arrays and reduction operations Specified by:!$acc loop [clause [,clause]] + a do loop The clauses for the!$acc loop directive are: collapse (n) gang, worker, vector [( scalar-integer-expression )] seq independent private (list) reduction ( operator : list) Combined directives are specified by:!$acc parallel loop,!$acc kernels loop
26 OpenACC: Data Directive The data construct defines scalars, arrays and subarrays to be allocated in the accelerator memory for the duration of the region Can be used to control if data should be copied-in or out from the host Specified by:!$acc data [clause [,clause]] The clauses for the!$acc data directive are: if( condition) copy, copyin, copyout (list) create (list) present, present or... (list) deviceptr (list)
27 OpenACC: Declare Directive Used in the variable declaration section of program to specify that a variable should be allocated, copy-in/out in an implicit data region of a function, subroutine or program If specified within a Fortran Module, the implicit data region is valid for the whole program Specified by:!$acc declare [clause [,clause]] Not fully implemented in PGI compiler release 12.8 yet? see later...
28 OpenACC: Update and Wait Directives OpenACC Update Directive:!$acc update [clause [,clause]] Used within a data region to update / synchronize the values of the arrays on both the host or accelerator The clauses for the!$acc update directive are: host,device (list) if (condition) async [( scalar-integer-expression)] OpenACC Wait Directive:!$acc wait [(scalar-integer-expression)] It causes the program to wait for completion of an asynchronous activity such as an accelerator parallel, kernel region or update directive It will test and evaluate the integer expression for completion If no argument is specified, the host process will wait until all asynchronous activities have completed
29 OpenACC: Directive Status The PGI compiler has been providing full OpenACC support since release 12.6 (August 2012) Some of the features supposed to be implemented in 12.6 are:!$acc parallel reduction and!$acc loop reduction!$acc parallel private() and!$acc loop private()!$acc parallel present() and!$acc parallel present or...()!$acc parallel deviceptr()!$acc declare device resident() and!$acc declare deviceptr() Actually, from the pgroup user forum: We re still missing a few OpenACC features and device resident is one of them. We expect it to be in by the release
30 OpenACC: Runtime Routines acc get num devices() acc set device type() acc get device type() acc set device num() acc get device num() acc async wait() acc async wait all() acc init() acc shutdown() acc on device() acc malloc() acc free()
31 OpenACC: Environment Variables ACC DEVICE TYPE and ACC DEVICE NUM ACC NOTIFY shows the list of launched kernels with detailed information about the number of the device who executes the kernel the name of the function the kernels is launched from, the file that contains it and the line in the file the grid and block dimensions Example: export ACC NOTIFY=1 file=.../testgpu.f90 function=testgpu line=18 device=0 grid=1x200 block=128 queue=0 PGI ACC TIME is equivalent to the flag -ta=nvidia,time Example: export PGI ACC TIME=1 18: region entered 1000 times time(us): total=536,077 init=177 region=535,900 kernels=11,627 data=286,750 w/o init: total=535,900 max=716 min=519 avg=535
32 OpenACC: Mandatory requirements Privatize arrays (scalars are private by default) Error: Parallelization would require privatization of array a(:) All loops must be rectangular Restructure linearized arrays with computed indices Error: Non-stride-1 accesses for array b Privatize live-out scalars Accelerator restriction: induction variable live-out from loop No function calls in directives regions manually or automatically inline subroutines) Error: Accelerator region ignored Accelerator restriction: function/procedure calls are not supported Avoid print or write operations
33 OpenACC: Tips All parallel regions should contain a loop directive Always start with parallel loop (not just parallel) Always use : when shaping with an entire dimension (i.e. A(:,1:2) ) Watch for runtime device errors, for example: Call to cumemcpydtoh returned error: Launch failed Call to cumemcpy2d returned error: Invalid value First get your code working without data regions, then add data regions: be aware of data movement leave data on GPU across procedure boundaries First get your code working without async, than add async Use directive clauses to optimize performance
34 OpenACC: Compilation flags The minimum set of flags needed to use OpenACC directives is: -acc (enables OpenACC) -ta=nvidia (target) -ta=nvidia:cc20 (GPU cabability, cc20 stands for 2.0) -ta=nvidia,time (enables Accelerator Kernel Timing data) -ta=nvidia,host (generates two versions of routines, one that runs on the host and one on the GPU) -Minline -Mipa=fast,inline,reshape (enables IPA, automatic inlining and array reshaping) -O2 at least (if less, -Mipa forces -O2 anyway) -Minfo=inline,accel (enables compiler feedback) Example: pgfortran -fast -O3 -acc -ta=host,nvidia:cc20,time -Minline -Mipa=fast,inline,reshape -Minfo=accel
35 OpenACC: Simple examples See live examples on PLX...
36 OpenACC: Performances Example: Jacobi relaxation Calculation: 4096 x 4096 mesh directory with source code on PLX: /plx/userexternal/lfranci0/stage/openacc/ wiki page with all details and results
37 ECHO-GPU: Necessary rearrangements Many changes and rearrangements were necessary to allow all subroutines in the evolution routine to be run as GPU kernels: it was mandatory to use at least -O2 optimization flag, interprocedural analysis and automatic inlining and reshaping some small subroutines, or subroutines used only once, were manually inlined to better manage the variables many do cycles were re-arranged for synchronization purpose all the print statements inside parallel regions were removed a common statement was removed many exit statements were removed temporary arrays were substituted with scalars where possible many little do cycles were manually unrolled many temporary arrays were privatized
38 ECHO-GPU: partial OpenACC implementation
39 ECHO-GPU: full OpenACC implementation only the primitive and conservative variables (together with the metric terms and the grid variables) are copied into the device at the beginning of the evolution and copied back to the host at the end of the simulation all the other variables are created directly on the GPU with!$acc create or!$acc device resident, and then declared present in all the subroutines called inside evolve
40 ECHO-GPU: Performances Test run: 2D mesh, 120x50 points, tmax=0.005 ms Execution time CPU-version: sec GPU-version: sec (already achieved) GPU-version: 40 sec (theoretical but plausible value with a small further effort) Speedup achieved: 1.5x theoretical: 2.3x Real runs have much longer evolution times and finer grids and in this case the performances are supposed to be better
41 ECHO-GPU: Short-term improvements Further improvements can be quite easily achieved by: extending the data region oustide the main do while cycle creating all the temporary variables directly on the device memory and copying only the primitive and conservative variables array together with the metric terms and grids avoiding the copy of some variables defined in inlined subfunctions using the new OpenACC features implemented in the PGI compiler release tuning the parallelization with the right choice of the numbers of threads and blocks
42 ECHO-GPU: Long-term improvements Further medium and long-term possible improvements include: implementing an OpenMP parallelization, taking advantage of the work already done to use OpenACC directives using MPI to manage multiple GPUs moving from cilindrycal coordinates to cartesian coordinates implementing a Phyton user interface implementing a parallel HDF5 I/0
43 Thank you for your attention Comment and suggestions are welcome and enjoy your acceleration! mail:
PGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationAdaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA
Adaptive Mesh Astrophysical Fluid Simulations on GPU San Jose 10/2/2009 Peng Wang, NVIDIA Overview Astrophysical motivation & the Enzo code Finite volume method and adaptive mesh refinement (AMR) CUDA
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationGetting Started with Directive-based Acceleration: OpenACC
Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences
More informationPGI Fortran & C Accelerator Programming Model. The Portland Group
PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationAn OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.
API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationRAMSES on the GPU: An OpenACC-Based Approach
RAMSES on the GPU: An OpenACC-Based Approach Claudio Gheller (ETHZ-CSCS) Giacomo Rosilho de Souza (EPFL Lausanne) Romain Teyssier (University of Zurich) Markus Wetzstein (ETHZ-CSCS) PRACE-2IP project EU
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationProgramming paradigms for GPU devices
Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationComparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015
Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming
More informationACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC
Nonlinear Computational Aeroelasticity Lab ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC N I C H O L S O N K. KO U K PA I Z A N P H D. C A N D I D AT E GPU Technology Conference
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationGPU Programming Paradigms
GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows
More informationOPENACC DIRECTIVES FOR ACCELERATORS NVIDIA
OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers
More informationAn Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel
An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler
More informationOpenACC and the Cray Compilation Environment James Beyer PhD
OpenACC and the Cray Compilation Environment James Beyer PhD Agenda A brief introduction to OpenACC Cray Programming Environment (PE) Cray Compilation Environment, CCE An in depth look at CCE 8.2 and OpenACC
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationLattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)
Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,
More informationOPENMP FOR ACCELERATORS
7th International Workshop on OpenMP Chicago, Illinois, USA James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supinski OPENMP FOR ACCELERATORS Accelerator programming Why a new model? There
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationDevelopment of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak
Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models C. Aberle, A. Hakim, and U. Shumlak Aerospace and Astronautics University of Washington, Seattle American Physical Society
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2016 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationIs OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels
National Aeronautics and Space Administration Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels Jose M. Monsalve Diaz (UDEL), Gabriele Jost (NASA), Sunita Chandrasekaran
More informationKernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets
P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents
More informationEXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March
EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationINTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies
INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying
More informationSpeeding Up Reactive Transport Code Using OpenMP. OpenMP
Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for parallelizing Fortran and C/C++ on shared memory systems Minimal changes to sequential code required Incremental
More informationIntroduction to OpenACC. 16 May 2013
Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationPGI Fortran & C Accelerator Compilers and Programming Model Technology Preview
PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview The Portland Group Published: v0.7 November 2008 Contents 1. Introduction... 1 1.1 Scope... 1 1.2 Glossary... 1 1.3 Execution
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationThe PGI Fortran and C99 OpenACC Compilers
The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction
More informationHigh-Order Finite Difference Schemes for computational MHD
High-Order Finite Difference Schemes for computational MHD A. Mignone 1, P. Tzeferacos 1 and G. Bodo 2 [1] Dipartimento di Fisica Generale, Turin University, ITALY [2] INAF Astronomic Observatory of Turin,,
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationarxiv: v1 [cs.ms] 8 Aug 2018
ACCELERATING WAVE-PROPAGATION ALGORITHMS WITH ADAPTIVE MESH REFINEMENT USING THE GRAPHICS PROCESSING UNIT (GPU) XINSHENG QIN, RANDALL LEVEQUE, AND MICHAEL MOTLEY arxiv:1808.02638v1 [cs.ms] 8 Aug 2018 Abstract.
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationLecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators
Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu
More informationParallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian
Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI
More informationPROFILER OPENACC TUTORIAL. Version 2018
PROFILER OPENACC TUTORIAL Version 2018 TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter 1. 2. 3. 4. 5. Tutorial Setup... 1 Profiling the application... 2 Adding OpenACC directives...4 Improving
More informationAccelerator programming with OpenACC
..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling
More informationAdvanced OpenACC. Steve Abbott November 17, 2017
Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information
More informationCompiling a High-level Directive-Based Programming Model for GPGPUs
Compiling a High-level Directive-Based Programming Model for GPGPUs Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman Department of Computer Science, University
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationFrom Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationExperiences with CUDA & OpenACC from porting ACME to GPUs
Experiences with CUDA & OpenACC from porting ACME to GPUs Matthew Norman Irina Demeshko Jeffrey Larkin Aaron Vose Mark Taylor ORNL is managed by UT-Battelle for the US Department of Energy ORNL Sandia
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationINTRODUCTION TO OPENACC
INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationINTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC
INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives
More informationOpenACC Course Lecture 1: Introduction to OpenACC September 2015
OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationPGPROF OpenACC Tutorial
PGPROF OpenACC Tutorial Version 2017 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Tutorial Setup...1 Chapter 2. Profiling the application... 2 Chapter 3. Adding OpenACC directives... 4 Chapter
More informationOptimizing OpenACC Codes. Peter Messmer, NVIDIA
Optimizing OpenACC Codes Peter Messmer, NVIDIA Outline OpenACC in a nutshell Tune an example application Data motion optimization Asynchronous execution Loop scheduling optimizations Interface OpenACC
More informationA Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIFF, IDRIS, GENCI, NVIDIA, Région
More informationOpenACC introduction (part 2)
OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence
More informationAdvanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018
Advanced OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2018 Outline Loop Directives Data Declaration Directives Data Regions Directives Cache directives Wait
More informationTowards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA
Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle,
More informationHigh-order, conservative, finite difference schemes for computational MHD
High-order, conservative, finite difference schemes for computational MHD A. Mignone 1, P. Tzeferacos 1 and G. Bodo 2 [1] Dipartimento di Fisica Generale, Turin University, ITALY [2] INAF Astronomic Observatory
More informationAFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library
AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library Synergy@VT Collaborators: Paul Sathre, Sriram Chivukula, Kaixi Hou, Tom Scogland, Harold Trease,
More informationA Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC
A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic
More informationSENSEI / SENSEI-Lite / SENEI-LDC Updates
SENSEI / SENSEI-Lite / SENEI-LDC Updates Chris Roy and Brent Pickering Aerospace and Ocean Engineering Dept. Virginia Tech July 23, 2014 Collaborations with Math Collaboration on the implicit SENSEI-LDC
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationOpenACC Accelerator Directives. May 3, 2013
OpenACC Accelerator Directives May 3, 2013 OpenACC is... An API Inspired by OpenMP Implemented by Cray, PGI, CAPS Includes functions to query device(s) Evolving Plan to integrate into OpenMP Support of
More informationAdvanced OpenMP. Lecture 11: OpenMP 4.0
Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013 Starting to make an appearance in production compilers What s new in 4.0 User defined reductions Construct cancellation
More informationAcceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationAn Introduction to OpenACC - Part 1
An Introduction to OpenACC - Part 1 Feng Chen HPC User Services LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 01-03, 2015 Outline of today
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 10-12, 2013 HPC@LSU
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationOpenACC Fundamentals. Steve Abbott November 13, 2016
OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationOpenACC Support in Score-P and Vampir
Center for Information Services and High Performance Computing (ZIH) OpenACC Support in Score-P and Vampir Hands-On for the Taurus GPU Cluster February 2016 Robert Dietrich (robert.dietrich@tu-dresden.de)
More informationOpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen
OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT
More informationAccelerating Harmonie with GPUs (or MICs)
Accelerating Harmonie with GPUs (or MICs) (A view from the starting-point) Enda O Brien, Adam Ralph Irish Centre for High-End Computing Motivation There is constant, insatiable demand for more performance
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationOpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer
OpenACC 2.5 and Beyond Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com OpenACC Timeline 2008 PGI Accelerator Model (targeting NVIDIA GPUs) 2011 OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs)
More informationIntroduction to Compiler Directives with OpenACC
Introduction to Compiler Directives with OpenACC Agenda Fundamentals of Heterogeneous & GPU Computing What are Compiler Directives? Accelerating Applications with OpenACC - Identifying Available Parallelism
More information