OPENACC ONLINE COURSE 2018

Similar documents
OPENACC ONLINE COURSE 2018

Advanced OpenACC. Steve Abbott November 17, 2017

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

OpenACC Fundamentals. Steve Abbott November 15, 2017

COMP Parallel Computing. Programming Accelerators using Directives

Programming paradigms for GPU devices

Introduction to Compiler Directives with OpenACC

S Comparing OpenACC 2.5 and OpenMP 4.5

Directive-based Programming for Highly-scalable Nodes

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

OpenACC 2.6 Proposed Features

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

OpenACC Course. Office Hour #2 Q&A

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPU Computing with OpenACC Directives Presented by Bob Crovella. Authored by Mark Harris NVIDIA Corporation

An Introduction to OpenACC - Part 1

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

GPU Computing with OpenACC Directives

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

OpenACC Accelerator Directives. May 3, 2013

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Accelerator programming with OpenACC

An Introduc+on to OpenACC Part II

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OpenACC Pipelining Jeff Larkin November 18, 2016

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

Blue Waters Programming Environment

Getting Started with Directive-based Acceleration: OpenACC

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

PGI Fortran & C Accelerator Programming Model. The Portland Group

Programming techniques for heterogeneous architectures. Pietro Bonfa SuperComputing Applications and Innovation Department

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

CS 677: Parallel Programming for Many-core Processors Lecture 12

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

OpenACC programming for GPGPUs: Rotor wake simulation

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Accelerating Financial Applications on the GPU

PGPROF OpenACC Tutorial

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

INTRODUCTION TO OPENACC

An Introduction to OpenAcc

Code Migration Methodology for Heterogeneous Systems

PGI Accelerator Programming Model for Fortran & C

OpenACC and the Cray Compilation Environment James Beyer PhD

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

OpenACC introduction (part 2)

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

Programming NVIDIA GPUs with OpenACC Directives

PROFILER OPENACC TUTORIAL. Version 2018

OpenMP 4.0. Mark Bull, EPCC

Advanced OpenMP. Lecture 11: OpenMP 4.0

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

The PGI Fortran and C99 OpenACC Compilers

OpenMP 4.0/4.5. Mark Bull, EPCC

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

ADVANCED OPENACC PROGRAMMING

Introduction to OpenACC. 16 May 2013

A case study of performance portability with OpenMP 4.5

Alfio Lazzaro: Introduction to OpenMP

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

OPENMP FOR ACCELERATORS

HPC Programming Models, Compilers, Performance Analysis IBM Systems Infrastructure Solutions. Ludovic Enault, IBM Geoffrey Pascal, IBM

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs

Algorithms for Auto- tuning OpenACC Accelerated Kernels

Introduction to OpenACC Directives. Duncan Poole, NVIDIA

OpenACC. Arthur Lei, Michelle Munteanu, Michael Papadopoulos, Philip Smith

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Programming with OpenMP. CS240A, T. Yang

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL. Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Transcription:

OPENACC ONLINE COURSE 2018 Week 3 Loop Optimizations with OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA

ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week 2 Data Management with OpenACC Week 3 Loop Optimizations with OpenACC Each week will have a corresponding lab, only an hour and a web browser is required Please ask questions in the Q&A box, our TA s will answer as quickly as possible

COURSE OBJECTIVE Enable YOU to accelerate YOUR applications with OpenACC.

WEEK 3 OUTLINE Topics to be covered Gangs, Workers, and Vectors Demystified GPU Profiles Loop Optimizations Week 3 Lab Where to Get Help

WEEKS 1 & 2 REVIEW

OPENACC DEVELOPMENT CYCLE Analyze your code to determine most likely places needing parallelization or optimization. Parallelize your code by starting with the most time consuming parts and check for correctness. Optimize your code to improve observed speed-up from parallelization. Analyze Optimize Parallelize

OpenACC Directives Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings #pragma acc data copyin(a,b) copyout(c) {... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { c[i] = a[i] + b[i];...... Incremental Single source Interoperable Performance portable CPU, GPU, Manycore

PARALLELIZE WITH OPENACC PARALLEL LOOP while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc parallel loop reduction(max:err) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Parallelize first loop nest, max reduction required. Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i])); #pragma acc parallel loop for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++; Parallelize second loop. We didn t detail how to parallelize the loops, just which loops to parallelize. 8

OPTIMIZED DATA MOVEMENT #pragma acc data copy(a[:n*m]) copyin(anew[:n*m]) while ( err > tol && iter < iter_max ) { err=0.0; Copy initial condition of #pragma acc parallel loop reduction(max:err) copyin(a[0:n*m]) copy(anew[0:n*m]) for( int j = 1; j < n-1; j++) { Anew, but not final value for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); Copy A to/from the accelerator only when needed. err = max(err, abs(anew[j][i] - A[j][i])); #pragma acc parallel loop copyin(anew[0:n*m]) copyout(a[0:n*m]) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++;

Speed-Up OPENACC SPEED-UP 45.00X 40.00X 35.00X 30.00X Speed-up 37.14X 38.81X 25.00X 20.00X 15.00X 10.00X 5.00X 0.00X 1.00X 3.13X 10.18X SERIAL MULTICORE NVIDIA TESLA K80 NVIDIA TESLA V100 (MANAGED) NVIDIA TESLA V100 (DATA) PGI 18.7, NVIDIA Tesla V100, Intel i9-7900x CPU @ 3.30GHz

GANGS, WORKERS, AND VECTORS DEMYSTIFIED

GANGS, WORKERS, AND VECTORS DEMYSTIFIED

GANGS, WORKERS, AND VECTORS DEMYSTIFIED

GANGS, WORKERS, AND VECTORS DEMYSTIFIED! How much work 1 worker can do is limited by his speed. A single worker can only move so fast.

GANGS, WORKERS, AND VECTORS DEMYSTIFIED! Even if we increase the size of his roller, he can only paint so fast. We need more workers!

GANGS, WORKERS, AND VECTORS DEMYSTIFIED Multiple workers can do more work and share resources, if organized properly.

GANGS, WORKERS, AND VECTORS DEMYSTIFIED By organizing our workers into groups (gangs), they can effectively work together within a floor. Groups (gangs) on different floors can operate independently. Since gangs operate independently, we can use as many or few as we need.

GANGS, WORKERS, AND VECTORS DEMYSTIFIED Even if there s not enough gangs for each floor, they can move to another floor when ready.

GANGS, WORKERS, AND VECTORS DEMYSTIFIED Our painter is like an OpenACC worker, he can only do so much. His roller is like a vector, he can move faster by covering more wall at once. Eventually we need more workers, which can be organized into gangs to get more done. Vector Gang Workers

GPU PROFILES

PROFILING GPU CODE (PGPROF) Using PGPROF to profile GPU code PGPROF presents far more information when running on a GPU We can view CPU Details, GPU Details, a Timeline, and even do Analysis of the performance

PROFILING GPU CODE (PGPROF) Using PGPROF to profile GPU code MemCpy(HtoD): This includes data transfers from the Host to the Device (CPU to GPU) MemCpy(DtoH): These are data transfers from the Device to the Host (GPU to CPU) Compute: These are our computational functions. We can see our calcnext and swap function

PROFILING GPU CODE (PGPROF) Using PGPROF to profile GPU code MemCpy(HtoD): This includes data transfers from the Host to the Device (CPU to GPU) MemCpy(DtoH): These are data transfers from the Device to the Host (GPU to CPU) Compute: These are our computational functions. We can see our calcnext and swap function

PROFILING GPU CODE (PGPROF) Using PGPROF to profile GPU code MemCpy(HtoD): This includes data transfers from the Host to the Device (CPU to GPU) MemCpy(DtoH): These are data transfers from the Device to the Host (GPU to CPU) Compute: These are our computational functions. We can see our calcnext and swap function

PROFILIG GPU CODE Managed Memory CPU Page Faults trigger Device to Host Migrations. GPU Page Faults trigger Host to Device Migrations. These may be hints that more needs to be parallelized and/or data optimization is needed.

PROFILIG GPU CODE Managed Memory CPU Page Faults trigger Device to Host Migrations. GPU Page Faults trigger Host to Device Migrations. These may be hints that more needs to be parallelized and/or data optimization is needed.

LOOP OPTIMIZATIONS

OPENACC LOOP DIRECTIVE Expressing parallelism Mark a single for loop for parallelization Allows the programmer to give additional information and/or optimizations about the loop Provides many different ways to describe the type of parallelism to apply to the loop Must be contained within an OpenACC compute region (either a kernels or a parallel region) to parallelize loops C/C++ #pragma acc loop for(int i = 0; i < N; i++) // Do something Fortran!$acc loop do i = 1, N! Do something

COLLAPSE CLAUSE collapse( N ) Combine the next N tightly nested loops Can turn a multidimensional loop nest into a single-dimension loop This can be extremely useful for increasing memory locality, as well as creating larger loops to expose more parallelism #pragma acc parallel loop collapse(2) for( i = 0; i < size; i++ ) for( j = 0; j < size; j++ ) double tmp = 0.0f; #pragma acc loop reduction(+:tmp) for( k = 0; k < size; k++ ) tmp += a[i][k] * b[k][j]; c[i][j] = tmp;

COLLAPSE CLAUSE collapse( 2 ) (0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3) #pragma acc parallel loop collapse(2) for( i = 0; i < 4; i++ ) for( j = 0; j < 4; j++ ) array[i][j] = 0.0f;

COLLAPSE CLAUSE When/Why to use it A single loop might not have enough iterations to parallelize Collapsing outer loops gives more scalable (gangs) parallelism Collapsing inner loops gives more tight (vector) parallelism Collapsing all loops gives the compiler total freedom, but may cost data locality

COLLAPSE CLAUSE #pragma acc data copy(a[:n*m]) copyin(anew[:n*m]) while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc parallel loop reduction(max:err) collapse(2) \ copyin(a[0:n*m]) copy(anew[0:n*m]) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i])); Collapse 2 loops into one for more flexibility in parallelizing. #pragma acc parallel loop collapse(2) \ copyin(anew[0:n*m]) copyout(a[0:n*m]) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++;

Speed-Up OPENACC SPEED-UP 1.04X 1.03X 1.03X 1.02X 1.02X 1.01X Speed-up 1.03X 1.01X 1.00X 1.00X 1.00X 0.99X 0.99X 0.98X NVIDIA TESLA V100 (DATA) NVIDIA TESLA V100 (COLLAPSE) PGI 18.7, NVIDIA Tesla V100, Intel i9-7900x CPU @ 3.30GHz

TILE CLAUSE tile ( x, y, z,...) Breaks multidimensional loops into tiles or blocks Can increase data locality in some codes Will be able to execute multiple tiles simultaneously #pragma acc kernels loop tile(32, 32) for( i = 0; i < size; i++ ) for( j = 0; j < size; j++ ) for( k = 0; k < size; k++ ) c[i][j] += a[i][k] * b[k][j];

TILE CLAUSE tile ( 2, 2 ) #pragma acc kernels loop tile(2,2) for(int x = 0; x < 4; x++){ for(int y = 0; y < 4; y++){ array[x][y]++; (0,0) (0,1) (0,2) (0,3) (0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3) (3,0) (3,1) (3,2) (3,3)

OPTIMIZED DATA MOVEMENT #pragma acc data copy(a[:n*m]) copyin(anew[:n*m]) while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc parallel loop reduction(max:err) tile(32,32) \ copyin(a[0:n*m]) copy(anew[0:n*m]) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i])); Create 32x32 tiles of the loops to better exploit data locality. #pragma acc parallel loop tile(32,32) \ copyin(anew[0:n*m]) copyout(a[0:n*m]) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++;

TILING RESULTS (V100) CPU Improvement GPU Improvement Baseline 1.00X 1.00X The collapse clause often requires an exhaustive search of options. For our example code CPU saw no benefit from tiling GPU saw anywhere from a 23% loss of performance to a 10% improvement 4x4 1.00X 0.77X 4x8 1.00X 0.92X 8x4 1.00X 0.95X 8x8 1.00X 1.02X 8x16 1.00X 0.99X 16x8 1.00X 1.02X 16x16 1.00X 1.01X 16x32 1.00X 1.06X 32x16 1.00X 1.07X 32x32 1.00X 1.10X

Speed-Up OPENACC SPEED-UP 1.15X Speed-up 1.13X 1.10X 1.05X 1.03X 1.00X 1.00X 0.95X 0.90X NVIDIA TESLA V100 (DATA) NVIDIA TESLA V100 (COLLAPSE) NVIDIA TESLA V100 (TILE) PGI 18.7, NVIDIA Tesla V100, Intel i9-7900x CPU @ 3.30GHz

GANG, WORKER, AND VECTOR CLAUSES The developer can instruct the compiler which levels of parallelism to use on given loops by adding clauses: gang Mark this loop for gang parallelism worker Mark this loop for worker parallelism vector Mark this loop for vector parallelism These can be combined on the same loop. #pragma acc parallel loop gang for( i = 0; i < size; i++ ) #pragma acc loop worker for( j = 0; j < size; j++ ) #pragma acc loop vector for( k = 0; k < size; k++ ) c[i][j] += a[i][k] * b[k][j]; #pragma acc parallel loop \ collapse(3) gang vector for( i = 0; i < size; i++ ) for( j = 0; j < size; j++ ) for( k = 0; k < size; k++ ) c[i][j] += a[i][k] * b[k][j];

SEQ CLAUSE The seq clause (short for sequential) will tell the compiler to run the loop sequentially In the sample code, the compiler will parallelize the outer loops across the parallel threads, but each thread will run the inner-most loop sequentially The compiler may automatically apply the seq clause to loops as well #pragma acc parallel loop for( i = 0; i < size; i++ ) #pragma acc loop for( j = 0; j < size; j++ ) #pragma acc loop seq for( k = 0; k < size; k++ ) c[i][j] += a[i][k] * b[k][j];

ADJUSTING GANGS, WORKERS, AND VECTORS The compiler will choose a number of gangs, workers, and a vector length for you, but you can change it with clauses. num_gangs(n) Generate N gangs for this parallel region num_workers(m) Generate M workers for this parallel region vector_length(q) Use a vector length of Q for this parallel region #pragma acc parallel num_gangs(2) \ num_workers(2) vector_length(32) { #pragma acc loop gang worker for(int x = 0; x < 4; x++){ #pragma acc loop vector for(int y = 0; y < 32; y++){ array[x][y]++;

COLLAPSE CLAUSE WITH VECTOR LENGTH #pragma acc data copy(a[:n*m]) copyin(anew[:n*m]) while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc parallel loop reduction(max:err) collapse(2) vector_length(1024) \ copyin(a[0:n*m]) copy(anew[0:n*m]) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i])); #pragma acc parallel loop collapse(2) vector_length(1024) \ copyin(anew[0:n*m]) copyout(a[0:n*m]) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++;

Speed-Up OPENACC SPEED-UP Speed-up 1.15X 1.13X 1.12X 1.10X 1.05X 1.03X 1.00X 1.00X 0.95X 0.90X NVIDIA TESLA V100 (DATA) NVIDIA TESLA V100 (COLLAPSE) NVIDIA TESLA V100 (TILE) NVIDIA TESLA V100 (COLLAPSE/VECTOR) PGI 18.7, NVIDIA Tesla V100, Intel i9-7900x CPU @ 3.30GHz

LOOP OPTIMIZATION RULES OF THUMB It is rarely a good idea to set the number of gangs in your code, let the compiler decide. Most of the time you can effectively tune a loop nest by adjusting only the vector length. It is rare to use a worker loop. When the vector length is very short, a worker loop can increase the parallelism in your gang. When possible, the vector loop should step through your arrays Gangs should come from outer loops, vectors from inner

CLOSING REMARKS

KEY CONCEPTS In this lab we discussed Some details that are available to use from a GPU profile Gangs, Workers, and Vectors Demystified Collapse clause Tile clause Gang/Worker/Vector clauses

OPENACC RESOURCES Guides Talks Tutorials Videos Books Spec Code Samples Teaching Materials Events Success Stories Courses Slack Stack Overflow Resources https://www.openacc.org/resources Success Stories https://www.openacc.org/success-stories FREE Compilers Compilers and Tools https://www.openacc.org/tools Events https://www.openacc.org/events