OpenACC Pipelining Jeff Larkin November 18, 2016

Similar documents
INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

Advanced OpenACC. Steve Abbott November 17, 2017

OpenACC Course. Lecture 4: Advanced OpenACC Techniques November 3, 2016

OpenACC Course. Lecture 4: Advanced OpenACC Techniques November 12, 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

S Comparing OpenACC 2.5 and OpenMP 4.5

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

OpenACC Course. Office Hour #2 Q&A

ADVANCED OPENACC PROGRAMMING

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

OPENACC ONLINE COURSE 2018

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

GPU Fundamentals Jeff Larkin November 14, 2016

STUDYING OPENMP WITH VAMPIR

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

STUDYING OPENMP WITH VAMPIR & SCORE-P

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016

COMP Parallel Computing. Programming Accelerators using Directives

Directive-based Programming for Highly-scalable Nodes

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

An Introduction to OpenAcc

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

A case study of performance portability with OpenMP 4.5

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC 2.6 Proposed Features

Getting Started with Directive-based Acceleration: OpenACC

GPU Computing with OpenACC Directives

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

OpenMP 4.0. Mark Bull, EPCC

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

OpenACC Fundamentals. Steve Abbott November 15, 2017

Optimizing OpenACC Codes. Peter Messmer, NVIDIA

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

OpenACC introduction (part 2)

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

OpenMP 4.0/4.5. Mark Bull, EPCC

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OPENACC ONLINE COURSE 2018

Accelerator programming with OpenACC

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

INTRODUCTION TO OPENACC

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Programming paradigms for GPU devices

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Computer Architecture

Blue Waters Programming Environment

Steve Scott, Tesla CTO SC 11 November 15, 2011

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Advanced OpenMP. Lecture 11: OpenMP 4.0

Exploring Dynamic Parallelism on OpenMP

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

CUDA. Matthew Joyner, Jeremy Williams

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenACC Accelerator Directives. May 3, 2013

Modernizing OpenMP for an Accelerated World

Early Experiences with the OpenMP Accelerator Model

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Introduction to Compiler Directives with OpenACC

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Designing and Optimizing LQCD code using OpenACC

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Hands-on CUDA Optimization. CUDA Workshop

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

OpenStaPLE, an OpenACC Lattice QCD Application

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Trellis: Portability Across Architectures with a High-level Framework

LECTURE ON MULTI-GPU PROGRAMMING. Jiri Kraus, November 14 th 2016

Research Article Multi-GPU Support on Single Node Using Directive-Based Programming Model

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

PGI Accelerator Programming Model for Fortran & C

Introduction to OpenACC. 16 May 2013

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

JCudaMP: OpenMP/Java on CUDA

Introduction to OpenACC

Transcription:

OpenACC Pipelining Jeff Larkin <jlarkin@nvidia.com>, November 18, 2016

Asynchronous Programming Programming multiple operations without immediate synchronization Real World Examples: Cooking a Meal: Boiling potatoes while preparing other parts of the dish. Three students working on a project on George Washington, one researches his early life, another his military career, and the third his presidency. Automobile assembly line: each station adds a different part to the car until it is finally assembled. 2

Asynchronous OpenACC So far, all OpenACC directives have been synchronous with the host Host waits for the parallel loop to complete Host waits for data updates to complete Most OpenACC directives can be made asynchronous Host issues multiple parallel loops to the device before waiting Host performs part of the calculation while the device is busy Data transfers can happen before the data is needed 3

Asynchronous Pipelining Very large operations may frequently be broken into smaller parts that may be performed independently. Pipeline Stage -A single step, which is frequently limited to 1 part at a time Photo by Roger Wollstadt, used via Creative Commons 4

Case Study: Image Filter The example code reads an image from a file, applies a simple filter to the image, then outputs the file. Skills Used: Parallelize the filter on the GPU Data Region and Update Directive Async and Wait directives Pipelining 5

Image Filter Code #pragma acc parallel loop collapse(2) gang vector for ( y = 0; y < h; y++ ); for ( x = 0; x < w; x++ ) { float blue = 0.0, green = 0.0, red = 0.0; for ( int fy = 0; fy < filtersize; fy++ ) { long iy = y - (filtersize/2) + fy; for ( int fx = 0; fx < filtersize; fx++ ) { blue+=filter[fy][fx]*imgdata[iy*step+ix*ch]; green+=filter[fy][fx]*imgdata[iy*step+ix*ch+1]; red+=filter[fy][fx]*imgdata[iy*step+ix*ch+2]; out[y * step + x * ch] = 255 - scale * blue; out[y * step + x * ch + 1 ] = 255 - scale * green; out[y * step + x * ch + 2 ] = 255 - scale * red; Iterate over all pixels Apply filter Save output 6

GPU Timeline 7

GPU Timeline Roughly 1/3 of the runtime is occupied with data transfers. 8

GPU Timeline Roughly 1/3 of the runtime is occupied with data transfers. What if we could overlap the copies with the computation? 9

GPU Timeline Roughly 1/3 of the runtime is occupied with data transfers. What if we could overlap the copies with the computation? Rough Math: 3.2ms -.5ms -.5ms = 2.2 ms 3.2 / 2.2 ~= 1.45X 10

Pipelining Data Transfers H2D kernel D2H H2D kernel D2H H2D kernel D2H Two Independent Operations Serialized H2D kernel D2H NOTE: In real applications, your boxes will not be so evenly sized. H2D kernel D2H Overlapping Copying and Computation 11

Blocking the Code Before we can overlap data transfers with computation, we need to break our work into smaller chunks. Since each pixel is calculated independently, the work can be easily divided. We ll divide along chunks of rows for convenience. 12

Blocked Image Filter Code #pragma acc data copyin(imgdata[:w*h*ch],filter) copyout(out[:w*h*ch]) for ( long blocky = 0; blocky < nblocks; blocky++){ long starty = blocky * blocksize; long endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector for (y=starty; y<endy; y++); for(x=0; x<w; x++ ) { float blue = 0.0, green = 0.0, red = 0.0; for ( int fy = 0; fy < filtersize; fy++ ) { long iy = y - (filtersize/2) + fy; for ( int fx = 0; fx < filtersize; fx++ ){ long ix = x - (filtersize/2) + fx; blue+=filter[fy][fx]*imgdata[iy*step+ix*ch]; green+=filter[fy][fx]*imgdata[iy*step+ix*ch+1]; red+=filter[fy][fx]*imgdata[iy*step+ix*ch+2]; out[y*step+x*ch] = 255 - (scale * blue); out[y*step+x*ch+1] = 255 - (scale * green); out[y*step+x*ch+2] = 255 - (scale * red); 1. Add blocking loop 13

Blocked Image Filter Code #pragma acc data copyin(imgdata[:w*h*ch],filter) copyout(out[:w*h*ch]) for ( long blocky = 0; blocky < nblocks; blocky++){ long starty = blocky * blocksize; long endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector for (y=starty; y<endy; y++); for(x=0; x<w; x++ ) { float blue = 0.0, green = 0.0, red = 0.0; for ( int fy = 0; fy < filtersize; fy++ ) { long iy = y - (filtersize/2) + fy; for ( int fx = 0; fx < filtersize; fx++ ){ long ix = x - (filtersize/2) + fx; blue+=filter[fy][fx]*imgdata[iy*step+ix*ch]; green+=filter[fy][fx]*imgdata[iy*step+ix*ch+1]; red+=filter[fy][fx]*imgdata[iy*step+ix*ch+2]; out[y*step+x*ch] = 255 - (scale * blue); out[y*step+x*ch+1] = 255 - (scale * green); out[y*step+x*ch+2] = 255 - (scale * red); 1. Add blocking loop 2. Adjust y to only iterate through rows within a single chunk. 14

Blocked Image Filter Code 3. Data region to handle copies #pragma acc data copyin(imgdata[:w*h*ch],filter) copyout(out[:w*h*ch]) for ( long blocky = 0; blocky < nblocks; blocky++){ long starty = blocky * blocksize; long endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector for (y=starty; y<endy; y++); for(x=0; x<w; x++ ) { float blue = 0.0, green = 0.0, red = 0.0; for ( int fy = 0; fy < filtersize; fy++ ) { long iy = y - (filtersize/2) + fy; for ( int fx = 0; fx < filtersize; fx++ ){ long ix = x - (filtersize/2) + fx; blue+=filter[fy][fx]*imgdata[iy*step+ix*ch]; green+=filter[fy][fx]*imgdata[iy*step+ix*ch+1]; red+=filter[fy][fx]*imgdata[iy*step+ix*ch+2]; out[y*step+x*ch] = 255 - (scale * blue); out[y*step+x*ch+1] = 255 - (scale * green); out[y*step+x*ch+2] = 255 - (scale * red); 1. Add blocking loop 2. Adjust y to only iterate through rows within a single chunk. 15

GPU Timeline Blocked Compute kernel is now broken in 8 blocks. Data transfers still happen at beginning and end. 16

OpenACC Update Directive Programmer specifies an array (or part of an array) that should be refreshed within a data region. (Host and Device copies are made coherent) do_something_on_device()!$acc update host(a) do_something_on_host()!$acc update device(a) Copy a from GPU to CPU Copy a from CPU to GPU Note: Update host has been deprecated and renamed self 17

Blocked Update Code Change data clauses to create #pragma acc data create(imgdata[w*h*ch],out[w*h*ch]) copyin(filter) for ( long blocky = 0; blocky < nblocks; blocky++) { long starty = MAX(0,blocky * blocksize - filtersize/2); long endy = MIN(h,starty + blocksize + filtersize/2); #pragma acc update device(imgdata[starty*step:(endy-starty)*step]) starty = blocky * blocksize; endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector for (y=starty; y<endy; y++) for ( x=0; x<w; x++ ) { <filter code ommitted> out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); #pragma acc update self(out[starty*step:blocksize*step]) 18

Blocked Update Code #pragma acc data create(imgdata[w*h*ch],out[w*h*ch]) copyin(filter) for ( long blocky = 0; blocky < nblocks; blocky++) { long starty = MAX(0,blocky * blocksize - filtersize/2); long endy = MIN(h,starty + blocksize + filtersize/2); #pragma acc update device(imgdata[starty*step:(endy-starty)*step]) starty = blocky * blocksize; endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector for (y=starty; y<endy; y++) for ( x=0; x<w; x++ ) { <filter code ommitted> out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); #pragma acc update self(out[starty*step:blocksize*step]) Change data clauses to create Update data one block at a time. 19

Blocked Update Code Change data clauses to create #pragma acc data create(imgdata[w*h*ch],out[w*h*ch]) copyin(filter) for ( long blocky = 0; blocky < nblocks; blocky++) { long starty = MAX(0,blocky * blocksize - filtersize/2); long endy = MIN(h,starty + blocksize + filtersize/2); #pragma acc update device(imgdata[starty*step:(endy-starty)*step]) starty = blocky * blocksize; endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector for (y=starty; y<endy; y++) for ( x=0; x<w; x++ ) { <filter code ommitted> out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); #pragma acc update self(out[starty*step:blocksize*step]) Update data one block at a time. Copy results back one block at a time. 20

GPU Timeline Blocked Updates Compute and Updates happen in blocks. The last step is to overlap compute and copy. 21

OpenACC async and wait async(n): launches work asynchronously in queue n wait(n): blocks host until all operations in queue n have completed Work queues operate in-order, serving as a way to express dependencies. Work queues of different numbers may (or may not) run concurrently. #pragma acc parallel loop async(1)... #pragma acc parallel loop async(1) for(int i=0; i<n; i++)... #pragma acc wait(1) for(int i=0; i<n; i++) If n is not specified, async will go into a default queue and wait will wait all previously queued work. 22

Pipelined Code #pragma acc data create(imgdata[w*h*ch],out[w*h*ch]) copyin(filter) { for ( long blocky = 0; blocky < nblocks; blocky++) { long starty = MAX(0,blocky * blocksize - filtersize/2); long endy = MIN(h,starty + blocksize + filtersize/2); #pragma acc update device(imgdata[starty*step:(endy-starty)*step]) async(block%3+1) starty = blocky * blocksize; endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector async(block%3+1) for (y=starty; y<endy; y++) for ( x=0; x<w; x++ ) { <filter code ommitted> out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); #pragma acc update self(out[starty*step:blocksize*step]) async(block%3+1) #pragma acc wait Cycle between 3 async queues by blocks. 23

Pipelined Code #pragma acc data create(imgdata[w*h*ch],out[w*h*ch]) copyin(filter) { for ( long blocky = 0; blocky < nblocks; blocky++) { long starty = MAX(0,blocky * blocksize - filtersize/2); long endy = MIN(h,starty + blocksize + filtersize/2); #pragma acc update device(imgdata[starty*step:(endy-starty)*step]) async(block%3+1) starty = blocky * blocksize; endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector async(block%3+1) for (y=starty; y<endy; y++) for ( x=0; x<w; x++ ) { <filter code ommitted> out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); #pragma acc update self(out[starty*step:blocksize*step]) async(block%3+1) #pragma acc wait Cycle between 3 async queues by blocks. Wait for all blocks to complete. 24

GPU Timeline Pipelined We re now able to overlap compute and copy. 25

Speed-up from original Step-by-Step Performance 2.00X 1.80X 1.60X 1.40X 1.45X 1.20X 1.00X 1.00X 0.93X 0.80X 0.60X 0.60X 0.40X 0.20X 0.00X Original Blocked Update Pipelined Source: PGI 16.9, NVIDIA Tesla K20c 26

Multi-GPU Pipelining 27

Multi-GPU OpenACC with OpenMP #pragma omp parallel num_threads(acc_get_num_devices(acc_device_nvidia)) { int myid = omp_get_thread_num(); acc_set_device_num(myid,acc_device_nvidia); int queue = 1; #pragma acc data create(imgdata[w*h*ch],out[w*h*ch]) { #pragma omp for schedule(static) for ( long blocky = 0; blocky < nblocks; blocky++) { // For data copies we need to include the ghost zones for the filter long starty = MAX(0,blocky * blocksize - filtersize/2); long endy = MIN(h,starty + blocksize + filtersize/2); #pragma acc update device(imgdata[starty*step:(endy-starty)*step]) async(queue) starty = blocky * blocksize; endy = starty + blocksize; #pragma acc parallel loop collapse(2) gang vector async(queue) for ( long y = starty; y < endy; y++ ){ for ( long x = 0; x < w; x++ ){ <filter code> #pragma acc update self(out[starty*step:blocksize*step]) async(queue) queue = (queue%3)+1; #pragma acc wait Set the device number, all work will be sent to this device. Use multiple queues per device to get copy compute overlap Wait for all work to complete (per device) 28

Multi-GPU Pipeline Profile 29

Speed-up from single GPU Step-by-Step Performance 4.00X 3.50X 3.28X 3.00X 2.71X 2.50X 2.00X 1.93X 1.50X 1.00X 1.00X 0.50X 0.00X 1 GPU 2 GPUs 3 GPUs 4 GPUs Source: PGI 16.10, NVIDIA Tesla K40 30

Where to find OpenACC help OpenACC Course Recordings - https://developer.nvidia.com/openacc-courses PGI Website - http://www.pgroup.com/resources OpenACC on StackOverflow - http://stackoverflow.com/questions/tagged/openacc OpenACC Toolkit - http://developer.nvidia.com/openacc-toolkit Parallel Forall Blog - http://devblogs.nvidia.com/parallelforall/ GPU Technology Conference - http://www.gputechconf.com/ OpenACC Website - http://openacc.org/ 31

Free Qwiklabs 1. Create an account with NVIDIA qwiklabs https://developer.nvidia.com/qwiklabssignup 2. Enter a promo code JEFF_LARKIN before submitting the form 3. Free credits will be added to your account 4. Start using taking labs! 32

Free Qwiklab Quests Email me if you run out of credits, I can always get you more! 33

Looking for a summer internship? Go to www.nvidia.com, look for Careers at the bottom of the page. 34