OpenACC compiling and performance tips. May 3, 2013

Similar documents
OpenACC Accelerator Directives. May 3, 2013

Blue Waters Programming Environment

Practical: a sample code

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC and the Cray Compilation Environment James Beyer PhD

Getting Started with Directive-based Acceleration: OpenACC

OpenACC introduction (part 2)

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

PGPROF OpenACC Tutorial

OpenACC programming for GPGPUs: Rotor wake simulation

Heidi Poxon Cray Inc.

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

PROFILER OPENACC TUTORIAL. Version 2018

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Introduction to OpenACC

Compiling applications for the Cray XC

OpenACC Support in Score-P and Vampir

Introduction to OpenACC. 16 May 2013

Accelerator programming with OpenACC

The PGI Fortran and C99 OpenACC Compilers

Programming NVIDIA GPUs with OpenACC Directives

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

OpenACC 2.6 Proposed Features

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

GPU Programming Paradigms

OpenACC Course. Office Hour #2 Q&A

Programming Environment 4/11/2015

INTRODUCTION TO OPENACC

The Eclipse Parallel Tools Platform

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Adrian Tate XK6 / openacc workshop Manno, Mar

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Introduction to OpenACC

Motivation OpenACC and OpenMP for Accelerators Cray Compilation Environment (CCE) Examples

A case study of performance portability with OpenMP 4.5

OpenACC (Open Accelerators - Introduced in 2012)

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

PVF RELEASE NOTES. Version 2017

Mathematical computations with GPUs

Programming paradigms for GPU devices

arxiv: v1 [hep-lat] 12 Nov 2013

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

OPENACC ONLINE COURSE 2018

Compiling environment

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

PROGRAMMING MODEL EXAMPLES

GPU Computing with OpenACC Directives

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

Cuda C Programming Guide Appendix C Table C-

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

CLAW FORTRAN Compiler source-to-source translation for performance portability

A simple tutorial on generating PTX assembler out of Ada source code using LLVM NVPTX backend

COMPILING FOR THE ARCHER HARDWARE. Slides contributed by Cray and EPCC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

PGI Visual Fortran Release Notes. Version The Portland Group

GPUs and Emerging Architectures

Debugging on Blue Waters

Intel Xeon Phi Coprocessor

COMP Parallel Computing. Programming Accelerators using Directives

Intro to Segmentation Fault Handling in Linux. By Khanh Ngo-Duy

S Comparing OpenACC 2.5 and OpenMP 4.5

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

An Introduc+on to OpenACC Part II

OpenACC Fundamentals. Steve Abbott November 13, 2016

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Directive-based Programming for Highly-scalable Nodes

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

PGI Visual Fortran Release Notes. Version The Portland Group

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

First steps on using an HPC service ARCHER

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

An Introduction to OpenACC - Part 1

OPENACC ONLINE COURSE 2018

Parallel Programming. Libraries and implementations

GPU Computing with OpenACC Directives Presented by Bob Crovella. Authored by Mark Harris NVIDIA Corporation

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Compiler Optimizations. Aniello Esposito HPC Saudi, March 15 th 2016

Steps to create a hybrid code

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Molecular Simulations using Monte Carlo and Molecular Dynamics

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI

OpenMP 4.0/4.5. Mark Bull, EPCC

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Transcription:

OpenACC compiling and performance tips May 3, 2013

OpenACC compiler support Cray Module load PrgEnv-cray craype-accel-nvidia35 Fortran -h acc, noomp # openmp is enabled by default, be careful mixing -fpic -dynamic -rm # include a.lst listing file to show the loop markup -G2 # -g has been observed to break Cray OpenACC code C -h pragma=acc -h nopragma=omp -fpic -dynamic -h msgs # show loop markup in stdout/stderr -Gp # bonus points to the person who synchronizes Cray compiler flags between fortran and c... 2

Cray -rm # loop mark arnoldg@h2ologin2:~/mori/pic2.0-acc-f> ftn -h acc -rm -c push2.f!$acc parallel num_gangs(1) vector_length(3072) ftn-7271 crayftn: WARNING GPUSH2L, File = push2.f, Line = 145 Unsupported OpenACC vector_length expression: Converting 3072 to 1024. arnoldg@h2ologin2:~/mori/pic2.0-acc-f> grep --after-context=5 '!$acc parallel num_gangs(1) vector_length(3072)' push2.lst 145. + G----------<!$acc parallel num_gangs(1) vector_length(3072) ftn-7271 ftn: WARNING File = push2.f, Line = 145 Unsupported OpenACC vector_length expression: Converting 3072 to 1024. 146. G!!$acc kernels 147. G!!data copy(part),copyin(fxy),create(nn,mm,dxp,dyp,np,mp,dx,dy,vx,vy) arnoldg@h2ologin2:~/mori/pic2.0-acc-f> grep 'line 145 ' push2.lst A region starting at line 145 and ending at line 240 was placed on the accelerator. arnoldg@h2ologin2:~/mori/pic2.0-acc-f>

OpenACC compiler support PGI Module load PrgEnv-pgi cudatoolkit Cudatoolkit is required, PGI is creating CUDA code as intermediate -ta=nvidia,keepgpu,keepptx Fortran, C # nice -acc -ta=nvidia -mcmodel=medium -Minfo=accel GNU Don't touch that dial! 4

PGI -Minfo=accel arnoldg@h2ologin2:~/mori/pic2.0-acc-f> ftn -acc -ta=nvidia -Minfo=accel -c push2.f gpush2l: 145, Accelerator kernel generated 145, CC 1.3 : 18 registers; 112 shared, 32 constant, 0 local memory bytes CC 2.0 : 26 registers; 0 shared, 132 constant, 0 local memory bytes 148,!$acc loop vector(3072)! blockidx%x threadidx%x 169, Sum reduction generated for sum1 145, Generating present_or_copy(part(:4,:nop)) Generating present_or_copyin(fxy(:,:,:)) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 148, Loop is parallelizable

05/03/13 May 3, 2013 OpenACC compiling and performance tips

05/03/13 OpenACC compiler support Cray Module load PrgEnv-cray craype-accel-nvidia35 Fortran -h acc, noomp # openmp is enabled by default, be careful mixing -fpic -dynamic -rm # include a.lst listing file to show the loop markup -G2 # -g has been observed to break Cray OpenACC code C -h pragma=acc -h nopragma=omp -fpic -dynamic -h msgs # show loop markup in stdout/stderr -Gp # bonus points to the person who synchronizes Cray compiler flags between fortran and c... 2 Watch out for the differences in flags when using the Cray compiler environment. Omitting craype-accel-nvidia35 can result in silent failure mode: ftn -h acc -c myfile.c ldd a.out not a dynamic executable...you've made an executable that will not use the accelerator at runtime, but the code will probably run.

05/03/13 Cray -rm # loop mark arnoldg@h2ologin2:~/mori/pic2.0-acc-f> ftn -h acc -rm -c push2.f!$acc parallel num_gangs(1) vector_length(3072) ftn-7271 crayftn: WARNING GPUSH2L, File = push2.f, Line = 145 Unsupported OpenACC vector_length expression: Converting 3072 to 1024. arnoldg@h2ologin2:~/mori/pic2.0-acc-f> grep --after-context=5 '!$acc parallel num_gangs(1) vector_length(3072)' push2.lst 145. + G----------<!$acc parallel num_gangs(1) vector_length(3072) ftn-7271 ftn: WARNING File = push2.f, Line = 145 Unsupported OpenACC vector_length expression: Converting 3072 to 1024. 146. G!!$acc kernels 147. G!!data copy(part),copyin(fxy),create(nn,mm,dxp,dyp,np,mp,dx,dy,vx,vy) arnoldg@h2ologin2:~/mori/pic2.0-acc-f> grep 'line 145 ' push2.lst A region starting at line 145 and ending at line 240 was placed on the accelerator. arnoldg@h2ologin2:~/mori/pic2.0-acc-f> Cray loop mark.lst listings are a good source of information about how the compiler optimized (or could not optimize) your code. Along with basic profiling info, this is a good starting point for code optimization.

05/03/13 OpenACC compiler support PGI Module load PrgEnv-pgi cudatoolkit Cudatoolkit is required, PGI is creating CUDA code as intermediate -ta=nvidia,keepgpu,keepptx Fortran, C # nice -acc -ta=nvidia -mcmodel=medium -Minfo=accel GNU Don't touch that dial! 4 Omitting the cudatoolkit module will result in silent failure mode (program compiles and links with possibly just a warning): /opt/pgi/12.10.0/linux86-64/12.10/lib/libacc1mp.a(nvinitx.o): In function ` pgi_cu_init_y': /usr/pgrel/extract/x86/2012/rte/accel/hammer/lib-linux86-64mp/../srcnv/nvinitx.c:238: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking Keepgpu and keepptx will preserve the intermediate cuda and/or Nvidia ptx assembly files. This is in contrast to the Cray compiler that compiles directly to ptx assembly and creates no intermediate cuda source. mcmodel=medium allows for larger static memory ( > 2 GB ) GNU may pickup OpenACC support once it becomes part of a future OpenMP standard.

05/03/13 PGI -Minfo=accel arnoldg@h2ologin2:~/mori/pic2.0-acc-f> ftn -acc -ta=nvidia -Minfo=accel -c push2.f gpush2l: 145, Accelerator kernel generated 145, CC 1.3 : 18 registers; 112 shared, 32 constant, 0 local memory bytes CC 2.0 : 26 registers; 0 shared, 132 constant, 0 local memory bytes 148,!$acc loop vector(3072)! blockidx%x threadidx%x 169, Sum reduction generated for sum1 145, Generating present_or_copy(part(:4,:nop)) Generating present_or_copyin(fxy(:,:,:)) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 148, Loop is parallelizable Similar to the Cray loopmark report, PGI compilers can emit detailed information about optimizations. -Minfo=accel will add comments for each OpenACC directive found and what was done (or not done) with that code region.