PROFILER OPENACC TUTORIAL. Version 2018

Similar documents
PGPROF OpenACC Tutorial

GPU LIBRARY ADVISOR. DA _v8.0 September Application Note

Using OpenACC with MPI Tutorial

PGDBG Installation Guide

PVF RELEASE NOTES. Version 2017

GRID SOFTWARE FOR RED HAT ENTERPRISE LINUX WITH KVM VERSION /370.28

PGI Installation and Release Notes for OpenPOWER CPUs

GRID SOFTWARE FOR MICROSOFT WINDOWS SERVER VERSION /370.12

PGI Fortran & C Accelerator Programming Model. The Portland Group

NSIGHT ECLIPSE PLUGINS INSTALLATION GUIDE

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

PGI Installation and Release Notes for OpenPOWER CPUs

PGDBG Installation Guide

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

PGDBG Debugger Installation Guide. Version The Portland Group

NVIDIA nforce 790i SLI Chipsets

KEPLER COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

OpenACC introduction (part 2)

NVWMI VERSION 2.18 STANDALONE PACKAGE

PASCAL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

MAXWELL COMPATIBILITY GUIDE FOR CUDA APPLICATIONS

NVWMI VERSION 2.24 STANDALONE PACKAGE

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

TUNING CUDA APPLICATIONS FOR MAXWELL

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Getting Started. NVIDIA CUDA C Installation and Verification on Mac OS X

PVF Release Notes. Version PGI Compilers and Tools

NVIDIA CAPTURE SDK 6.1 (WINDOWS)

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Mac OS X. May 2009 DU _v01

NVIDIA CUDA C GETTING STARTED GUIDE FOR MAC OS X

TUNING CUDA APPLICATIONS FOR MAXWELL

Getting Started. NVIDIA CUDA Development Tools 2.3 Installation and Verification on Mac OS X

RMA PROCESS. vr384 October RMA Process

NVIDIA CAPTURE SDK 6.0 (WINDOWS)

PGI Visual Fortran Release Notes. Version The Portland Group

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview

NVBLAS LIBRARY. DU _v6.0 February User Guide

Directive-based Programming for Highly-scalable Nodes

PGDBG Debugger Release Notes. Version The Portland Group

NVIDIA CAPTURE SDK 7.1 (WINDOWS)

PGI VISUAL FORTRAN RELEASE NOTES. Version 2017

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

QUADRO SYNC II FIRMWARE VERSION 2.02

PGDBG Debugger Release Notes. Version The Portland Group

PGI Visual Fortran Release Notes. Version The Portland Group

PGI VISUAL FORTRAN RELEASE NOTES. Version 2018

OpenACC Fundamentals. Steve Abbott November 15, 2017

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

PGDBG Debugger 2013 Installation Guide. Version The Portland Group

SDK White Paper. Vertex Lighting Achieving fast lighting results

Programming paradigms for GPU devices

PVF Release Notes. Version PGI Compilers and Tools

PGI VISUAL FORTRAN INSTALLATION GUIDE. Version 2018

COMP Parallel Computing. Programming Accelerators using Directives

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Application Note. NVIDIA Business Platform System Builder Certification Guide. September 2005 DA _v01

Accelerator programming with OpenACC

INTRODUCTION TO OPENACC

VIRTUAL GPU LICENSE SERVER VERSION

CUDA TOOLKIT 3.2 READINESS FOR CUDA APPLICATIONS

PVF INSTALLATION GUIDE. Version 2017

Technical Brief. LinkBoost Technology Faster Clocks Out-of-the-Box. May 2006 TB _v01

Introduction to Compiler Directives with OpenACC

VIRTUAL GPU MANAGEMENT PACK FOR VMWARE VREALIZE OPERATIONS

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Technical Brief. AGP 8X Evolving the Graphics Interface

PVF Release Notes. Version PGI Compilers and Tools

OPENACC ONLINE COURSE 2018

SDK White Paper. Matrix Palette Skinning An Example

GRID SOFTWARE FOR HUAWEI UVP VERSION /370.12

GRID SOFTWARE MANAGEMENT SDK

The PGI Accelerator Programming Model on NVIDIA GPUs Part 1

NSIGHT ECLIPSE EDITION

GETTING STARTED WITH CUDA SDK SAMPLES

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

PGI Accelerator Programming Model for Fortran & C

PGI Visual Fortran Installation Guide. Version The Portland Group

PGI Release Notes for Intel 64 and AMD64 CPUs

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

GRID VIRTUAL GPU FOR HUAWEI UVP Version ,

NVIDIA CUDA INSTALLATION GUIDE FOR MAC OS X

PGI Accelerator Compilers OpenACC Getting Started Guide

TESLA C2050 COMPUTING SYSTEM

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

GRID SOFTWARE FOR HUAWEI UVP VERSION /370.28

NSIGHT COMPUTE. v March Customization Guide

DRIVER PERSISTENCE. vr384 October Driver Persistence

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño

GPU Programming Paradigms

GRID VGPU FOR VMWARE VSPHERE Version /

Introduction to OpenACC. 16 May 2013

OpenACC Fundamentals. Steve Abbott November 13, 2016

PGI Visual Fortran Release Notes. Version The Portland Group

VIRTUAL GPU SOFTWARE MANAGEMENT SDK

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Microsoft Windows XP and Windows Vista

VIRTUAL GPU SOFTWARE R384 FOR RED HAT ENTERPRISE LINUX WITH KVM

Enthusiast System Architecture Certification Feature Requirements

Transcription:

PROFILER OPENACC TUTORIAL Version 2018

TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter 1. 2. 3. 4. 5. Tutorial Setup... 1 Profiling the application... 2 Adding OpenACC directives...4 Improving Data Transfer Speed... 6 What more can be done?... 8 Version 2018 ii

LIST OF TABLES Table 1 Performance Results... 9 Version 2018 iii

Version 2018 iv

Chapter 1. TUTORIAL SETUP The tutorial assumes you have installed and configured the PGI 2018 compilers with the NVIDIA CUDA toolkit included. The tutorial uses the Cactus BenchADM Benchmark (see http://cactuscode.org/download/). Note that BenchADM uses double-precision arithmetic. First download the tutorial source file package, http://www.pgroup.com/lit/samples/ benchadm_nvprof.tar.gz from the PGI website and unpack the source in a local directory. Next bring up a shell command window (we use csh), cd to the directory PGI_Acc_benchADM, and ensure your environment is initialized for use of the PGI 2018 compilers: % which pgfortran /opt/pgi/linux86-64/2018/bin/pgfortran If your environment is not yet initialized, execute the following commands to do so: % setenv PGI /opt/pgi % set path=($pgi/linux86-64/18.10/bin $path) % setenv LM_LICENSE_FILE=$LM_LICENSE_FILE:/opt/pgi/license.dat You may have to ask your system administrator for the pathname to the FlexNET license file containing PGI 2018 license keys. Note the information for this tutorial was gathered on a workstation with a Intel Core i7-3930k and NVIDIA Tesla K40c. All examples were compiled with PGI version 16.5, your results may vary. Version 2018 1

Chapter 2. PROFILING THE APPLICATION The first step to effectively working with the PGI Accelerator compilers is to understand which parts of your code can benefit from GPU acceleration. The most efficient way to determine this is to profile your code using the PGI profiler. First, build BenchADM with the options to enable Common Compiler Feedback Format (CCFF) and profile the code using the profiler. % make OPT="-fast -Minfo=ccff" build_base % pgprof --cpu-profiling-mode flat./bin/benchadm_base BenchADM_200l.par ======== CPU profiling result (flat): 99.72% 118.9s bench_staggeredleapfrog2_ 0.15% 180ms munmap 0.07% 80ms PUGH_ReductionMinVal 0.06% 70ms PUGH_ReductionMaxVal Time(%) Time Name Two parameters files have been provided. BenchADM_200l.par sets up a large simulation which will use to measure the application's performance. A much smaller BenchADM_20l.par parameter file has also been provided to do quick validation of the results and the effects of running the application on the GPU. BenchADM has a fairly simple profile with nearly all of the time spent in a single routine called "bench_staggeredleapfrog2". OpenACC is designed to accelerate loops in your code so let's drill down further and find the particular source lines in "bench_staggeredleapfrog2" that account for most of the time being spent in this routine (notice the addition of the option '--cpu-profiling-scope instruction'): % pgprof --cpu-profiling-scope instruction benchadm_base BenchADM_200l.par ======== CPU profiling result (flat): Time(%) Time Name 1.40% 1.67s bench_staggeredleapfrog2_ 0x8880) 1.22% 1.46s bench_staggeredleapfrog2_ 0x87e6) 1.14% 1.36s bench_staggeredleapfrog2_ 0x88c7) 1.02% 1.22s bench_staggeredleapfrog2_ 0x8722) --cpu-profiling-mode flat./bin/ (./src/staggeredleapfrog2.f:341 (./src/staggeredleapfrog2.f:341 (./src/staggeredleapfrog2.f:341 (./src/staggeredleapfrog2.f:341 Version 2018 2

Profiling the application 0.94% 0x874a) 0.88% 0x8833) 0.86% 0xc661) 1.12s bench_staggeredleapfrog2_ (./src/staggeredleapfrog2.f:341 1.05s bench_staggeredleapfrog2_ (./src/staggeredleapfrog2.f:341 1.03s bench_staggeredleapfrog2_ (./src/staggeredleapfrog2.f:374 The line 341 is part of a loop begining at line 338. Given the high percentage of time spent in this loop, it seems to be a good candidate for GPU acceleration. Version 2018 3

Chapter 3. ADDING OPENACC DIRECTIVES Now that we have identified a likely candidate loop for acceleration and created a performance baseline, let's add our first OpenACC directives to the code. Open the file "src/staggeredleapfrog2.f" in a text editor. Insert "!$ACC KERNELS" on a new line right above line 338, then add a closing directive "!$ACC END KERNELS" on a new line right below line 815 (the ENDDO corresponding to the DO loop at line 366). Write the modified code out into a file named "src/staggeredleapfrog2_acc1.f". Re-build the code using the command: % make OPT="-fast -ta=tesla -Minfo=accel" build_acc1 The "-ta=tesla" target accelerator option instructs the compiler to interpret OpenACC directives and try to offload accelerator regions to an NVIDIA GPU. The -Minfo=accel option instructs the compiler to emit CCFF messages to stdout during compilation. The messages indicate whether or not an accelerator region is generated, which arrays are copied to/from the host to the GPU accelerator, and report the schedule used (parallel and/or vector) to map the kernel onto the GPU accelerator processors. Scrolling back in your command window, you should see output from the PGI compiler that looks something like the following: pgfortran -fast -ta=tesla -Minfo=accel -c -o objdir/ StaggeredLeapfrog2_acc1.o./src/StaggeredLeapfrog2_acc1.F bench_staggeredleapfrog2: 366, Generating copyout(adm_kzz_stag(2:nx-1,2:ny-1,2:nz-1)...) Generating copyin(lalp(:nx,:ny,:nz)) Generating copyout(adm_kxx_stag(2:nx-1,2:ny-1,2:nz-1)) Generating copyin(lgzz(:nx,:ny,:nz),lgyz(:nx,:ny,:nz)...) 367, Loop is parallelizable 371, Loop is parallelizable 375, Loop is parallelizable Accelerator kernel generated Generating Tesla code 367,!$acc loop gang! blockidx%y 371,!$acc loop gang, vector(4)! blockidx%z threadidx%y 375,!$acc loop gang, vector(32)! blockidx%x threadidx%x Voila! You've succeeded in mapping a the loop start at line 367 onto a GPU accelerator. If you don't see such output, you may have made an error editing the source. You will find versions pre-edited with the directives and known to work in the file "src/steps/ Version 2018 4

Adding OpenACC directives StaggeredLeapfrog2_acc1.F". Feel free to compare to that version, or to simply copy it to "src/staggeredleapfrog2_acc1.f" and try again. There are similar pre-edited versions in "src/steps" for each step of this tutorial. Since this is our first attempt to accelarate this code onto the GPU, lets just profile the sanity run using the following command: % pgprof --cpu-profiling off./bin/benchadm_acc1 BenchADM_20l.par The timing data produced by the PGI profiler now shows information about the operations executed on the GPU. Both the kernel time (50ms) and data transfer times (444s down / 419s up) is included. ==23973== Profiling result: Time(%) Time Calls Avg 48.69% 445.31ms 199500 2.2320us 45.82% 419.04ms 195700 2.1410us 5.04% 46.073ms 100 460.73us bench_staggeredleapfrog2_375_gpu 0.45% 4.0812ms 100 40.811us bench_staggeredleapfrog2_342_gpu Min 1.9890us 1.8880us 457.03us Max 273.48us 186.04us 464.10us 39.456us 42.112us Name [CUDA memcpy HtoD] [CUDA memcpy DtoH] Before we continue, let us fix the high number of data transfers that time it takes to transfer this data between the CPU and GPU. If these numbers scale and we continue to see ~20 times as much time spent moving data between the host and device we are not going to see any performance gain by accelerating this loop on the GPU. The next chapter will begin by adding directives to move data explictly. The GPU accelerator card needs to be initialized before use. This is done by default the first time an accelerator region is entered. Alternatively, you can add a call to the function "acc_init()" to initialize the device. We have added the following code to BenchADM's initialization routine in the file "src/cactus/initialisecactus_acc.c": #ifdef _OPENACC # include "accel.h" #endif... #ifdef _OPENACC acc_init(acc_device_nvidia); #endif The _OPENACC macro name is defined when OpenACC directives are enabled. This is an example of a portable method for factoring GPU initialization out of accelerator regions, to ensure the timings for the regions do not include initialization overhead. Version 2018 5

Chapter 4. IMPROVING DATA TRANSFER SPEED In order to minimize data transfers between the host memory and GPU accelerator device memory, the PGI Accelerator compilers compute the minimal amount of data that must be copied. The following -Minfo message shows the compiler has determined that only a section of the array "adm_kzz_stag_p_p" must be copied to the GPU. Generating copyin(...,adm_kzz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1),...) However, this array section is not contiguous; because the data transfer is done via DMA (direct memory access) operation, the transfer will send each contiguous subarray separately. Often, it's more efficient to copy an entire array. Even though it involves transferring more data, it can be done in one large DMA operation. Copy "src/staggeredleapfrog2_acc1.f" to "src/staggerredleapfrog2_acc2.f" and bring up the latter in a text editor. Modify the accelerator region directive by inserting data copy clauses that will cause whole arrays to be copied, rather than array sections. Hint: look at the -Minfo messages output by the compiler to determine which arrays are copied in/out, and insert directives that look as follows to explicitly specify the wholearray copies:!$acc KERNELS!$ACC& COPYOUT (!$ACC& ADM_kxx_stag(1:nx,1:ny,1:nz),!$ACC& ADM_kxy_stag(1:nx,1:ny,1:nz),... Alternatively, just copy the file "src/steps/staggeredleapfrog2_acc2.f" to the "src" directory. Build and profile the revised code using the following commands: % make OPT="-fast -ta=tesla -Minfo=accel" build_acc2 % pgprof./bin/benchadm_acc2 BenchADM_200l.par... ==24179== Profiling result: Time(%) Time Calls Avg 60.82% 23.6223s 100 236.22ms bench_staggeredleapfrog2_406_gpu 29.77% 11.5609s 9400 1.2299ms 9.41% 3.65443s 2400 1.5227ms... Min 235.14ms Max 236.50ms Name 2.1120us 1.2963ms 1.6457ms 1.6557ms [CUDA memcpy HtoD] [CUDA memcpy DtoH] Version 2018 6

Improving Data Transfer Speed The revised code with directives helps the compiler optimize data movement and reduces the host-to-gpu data transfer time significantly. Moving the data between the CPU and GPU can now be completed in about 12s. Version 2018 7

Chapter 5. WHAT MORE CAN BE DONE? In the final example file "src/staggeredleapfrog2_acc4.f", we can improve benchadm by expanding the accelerator region to include the entire function. Although the first three loops have a low intensity and would not be good candidates for acceleration by themselves, most of the data transfer cost has already been paid so it's beneficial to offload them to the GPU. Try copying "src/staggeredleapfrog2_acc3.f" to "src/staggeredleapfrog2_acc4.f" and bring up the latter in your editor. Move the!$acc KERNELS directive to encompass all four loops. Which arrays must be added to the COPYIN or COPYOUT clauses? Which arrays can be allocated locally using the LOCAL clause? Feel free to take a look at "src/ STEPS/StaggeredLeapfrog2_acc4.F" for hints. Note that the LOCAL clause on several of the arrays used only within the accelerator region eliminates the need to copy them at all, but some arrays must be added as inputs or outputs to the first three loops in the routine. With these changes, the data transfer and aggregate kernel execution times increase slightly, but overall wall-clock time drops. % make OPT="-fast -ta=tesla -Minfo=accel" build_acc4 % pgprof./bin/benchadm_acc4 BenchADM_200.par... ==24586== Profiling result: Time(%) Time Calls Avg Min Max 48.11% 26.7256s 100 267.26ms 265.94ms 268.89ms bench_staggeredleapfrog2_410_gpu 35.09% 19.4933s 15900 1.2260ms 2.1440us 1.6592ms 13.16% 7.31232s 4800 1.5234ms 1.2969ms 1.6274ms 3.60% 1.99779s 100 19.978ms 19.917ms 20.043ms bench_staggeredleapfrog2_374_gpu 0.02% 11.469ms 100 114.69us 113.54us 120.67us bench_staggeredleapfrog2_334_gpu 0.02% 11.439ms 100 114.39us 113.63us 115.20us bench_staggeredleapfrog2_353_gpu Name [CUDA memcpy HtoD] [CUDA memcpy DtoH] The following table summarizes the performance at each step of this tutorial. Version 2018 8

What more can be done? Table 1 Performance Results Step Total Execution (s) Kernel(s) Execution (s) Data Movement (s) CPU Only (benchadm_base) 109 108 0 Explict Data Tranfers (benchadm_acc2) 85 23 16 Version 2018 9

Notice ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation. Trademarks NVIDIA, the NVIDIA logo, Cluster Development Kit, PGC++, PGCC, PGDBG, PGF77, PGF90, PGF95, PGFORTRAN, PGHPF, PGI, PGI Accelerator, PGI CDK, PGI Server, PGI Unified Binary, PGI Visual Fortran, PGI Workstation, PGPROF, PGROUP, PVF, and The Portland Group are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright 2018 NVIDIA Corporation. All rights reserved. PGI Compilers and Tools