Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Similar documents
S Comparing OpenACC 2.5 and OpenMP 4.5

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

OpenACC 2.6 Proposed Features

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

OpenACC and the Cray Compilation Environment James Beyer PhD

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

ECE 574 Cluster Computing Lecture 10

[Potentially] Your first parallel application

Advanced OpenMP. Lecture 11: OpenMP 4.0

Overview: The OpenMP Programming Model

OpenMP 4.5: Threading, vectorization & offloading

Introduction to OpenMP.

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material

PGI Accelerator Programming Model for Fortran & C

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Parallel Programming

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to. Slides prepared by : Farzana Rahman 1

Advanced OpenACC. Steve Abbott November 17, 2017

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Lecture 4: OpenMP Open Multi-Processing

15-418, Spring 2008 OpenMP: A Short Introduction

PGI Fortran & C Accelerator Programming Model. The Portland Group

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Allows program to be incrementally parallelized

Progress on OpenMP Specifications

An Introduction to OpenAcc

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Multi-core Architecture and Programming

Concurrent Programming with OpenMP

An Introduction to OpenMP

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

OpenMP Application Program Interface

Advanced OpenMP Features

Open Multi-Processing: Basic Course

OpenMP for Accelerators

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Programming. Libraries and Implementations

EE/CSCI 451: Parallel and Distributed Computation

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Shared Memory Programming with OpenMP

Alfio Lazzaro: Introduction to OpenMP

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP 4.0 (and now 5.0)

OpenMP Application Program Interface

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Multithreading in C with OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Introduction to OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

COMP Parallel Computing. Programming Accelerators using Directives

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Shared Memory Programming Model

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

A case study of performance portability with OpenMP 4.5

OPENMP FOR ACCELERATORS

CSL 860: Modern Parallel

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP programming Part II. Shaohao Chen High performance Louisiana State University

Introduction to Standard OpenMP 3.1

OpenMP Technical Report 3 on OpenMP 4.0 enhancements

Directive-based Programming for Highly-scalable Nodes

Introduction to OpenMP

OpenMP - Introduction

OpenACC Fundamentals. Steve Abbott November 15, 2017

Shared Memory programming paradigm: openmp

Distributed Systems + Middleware Concurrent Programming with OpenMP

New Features after OpenMP 2.5

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Barbara Chapman, Gabriele Jost, Ruud van der Pas

OpenMP: Open Multiprocessing

OpenACC Course. Office Hour #2 Q&A

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

A brief introduction to OpenMP

OpenMP on Ranger and Stampede (with Labs)

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

OpenMP: Open Multiprocessing

Transcription:

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming accelerators. Insights into why some decisions were made in initial releases and why they were changed later will be used to explain the trade-offs required to achieve agreement on these complex directive sets. 2

Why a talk about OpenMP vs OpenACC? These are the two dominate directive based methods for programming accelerators They are relatives! Why not, it seems like a nice non-controversial topic 3

OpenMP status Agenda OpenACC status Major differences Loops 4

Status OpenMP 4.1 New or changed functionality use-device-ptr is-device-ptr nowait depend (first)private defaultmap declare target link enter exit target data scalar implicit data-sharing firstprivate deeper-shallow copy device memory routines Existing OpenACC 2.0 functionality host_data deviceptr async async()/wait() (first)private declare link enter exit data scalar implicit data-sharing device memory routines 5

Status OpenACC 2.5 New or changed functionality declare create on allocatable init, shutdown, set directives if_present on update routine bind openacc_lib.h deprecated API to get/set default async queue switch default present_or behavior default(present) async version of memory APIs acc_memcpy_device profiling support OpenMP 4.1 functionality Update no-op if not present omp_target_memcpy[_rect] TR 6

Major differences What is the fundamental difference between OpenMP and OpenACC? 1) Prescriptive vs Descriptive (or is it?) 2) Synchronization 3) Independent loop iterations This is an interesting and important topic 4) Expressiveness vs Performance and Portability Is this real or just perceived? 7

Scope: The OpenMP API covers only user-directed parallelization, wherein the programmer explicitly specifies the actions to be taken by the compiler and runtime system in order to execute the program in parallel. OpenMP-compliant implementations are not required to check for data dependencies, data conflicts, race conditions, or deadlocks, any of which may occur in conforming programs. In addition, compliant implementations are not required to check for code sequences that cause a program to be classified as non- conforming. Application developers are responsible for correctly using the OpenMP API to produce a conforming program. The OpenMP API does not cover compiler-generated automatic parallelization and directives to the compiler to assist such parallelization. OpenMP 4.1 Comment Draft, Page 1 8

Introduction: The programming model allows the programmer to augment information available to the compilers, including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details. Scope: This OpenACC API document covers only user-directed accelerator programming, where the user specifies the regions of a host program to be targeted for offloading to an accelerator device. The remainder of the program will be executed on the host. This document does not describe features or limitations of the host programming environment as a whole; it is limited to specification of loops and regions of code to be offloaded to an accelerator. OpenACC 2.5 Comment Draft, page 7 9

So what's different between those two statements? wherein the programmer explicitly specifies the actions to be taken by the compiler -- OpenMP EXPLICIT guidance on mapping of loops user specifies the regions of a host program to be targeted for offloading -- OpenACC IMPLICIT and EXPLICIT So is this difference significant? Clearly we thought is was when we formed the OpenACC technical committee 10

Major differences 11

Synchronization OpenMP master critical barrier taskwait taskgroup atomic restrictions? flush ordered depend OpenACC atomic with restrictions async/wait if you squint 12

Present or not OpenMP learns from OpenACC and vice verse OpenACC 1.0 had two forms of data motion clauses, one form for testing presence and one for skipping the presence test. OpenMP 4.0 had one form of data motion clause which always checks for presence OpenACC 2.5 eliminates the form that skips the presence test actually, it makes the fast path do a present test 13

Update directives Present or not (take 2) OpenACC 1.0 and OpenMP 4.0 both made it an error to do an update on an object that was not present on the device OpenACC 2.0 and OpenMP 4.1 both relaxed this hindrance to programmers OpenACC 2.0 added the if_present clause OpenMP 4.1 just makes the update a no-op if the object is not present 14

Routines and/or calls OpenACC 1.0 did not support calls, either they had to flatten out or they could not be present, with a few minor exceptions provided by implementations. OpenMP 4.0 provided target declare which places objects and routines on the device The OpenMP compiler does not look for automatic parallelism thus it does routines do not need to reserve parallelism for themselves OpenACC 2.0 provided routine which caused the compiler to generate one type of subroutine for the device, either gang, worker, vector or sequential 15

Default firstprivate scalars OpenACC 1.0 made all scalars firstprivate unless overridden by the programmer OpenMP 4.0 made all scalars map(inout) on target constructs What happens if the object is already on the device? Either the system goes to the device to determine important kernel sizing details or it has to make safe assumptions User: why is my code slow? User: How do I fix it? User: That s ridiculous! Implementer: because the kernel is mis-sized. Implementer: you have to rewrite your code to Implementer: Agreed OpenMP 4.1 makes all scalars firstprivate by default 16

Biggest mistake made in both directive sets OpenACC 1.0 and OpenMP 4.0 both contained structured data constructs C++ has constructor and destructors (surprise!) Structured data constructs do not mix well in this environment C and Fortran developers create their own initializers, (constructors) and finalizers, (destructors) Both OpenACC and OpenMP learned from this mistake 17

Host or device compilation OpenMP 4.1 OpenACC 2.5 Host omp parallel for/do Host acc parallel loop Device omp target parallel do or omp target teams distribute parallel for or omp target teams distribute parallel for simd or omp target teams distribute simd or omp target parallel for simd or omp target simd Device acc parallel loop* 18

OpenACC loops vs OpenMP loops 19

Independent loop example(s) In C foo( int *a, int n ) { for ( int i = 0; i<n; i++) { a[i] = i; foo2( int * a, int * b, int n ) { for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; 20

Independent loop example(s) In C foo( int *a, int n ) { for ( int i = 0; i<n; i++) { a[i] = i; foo2( int * restrict a, int * restrict b, int n ) { for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; 21

Independent loop example(s) In Fortran subroutine foo( a, n ) integer :: a(n) integer n,i do i = 1, n a(i) = i; enddo end subroutine foo subroutine foo2( a, b, n ) integer :: a(n), b(n) integer n,i do i = 1, n a(i) = a(i) * b(i); enddo end subroutine foo2 22

Parallelizing Independent loops Using OpenMP and OpenACC foo2( int * a, int * b, int n ) { #pragma omp parallel for for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; foo2( int * a, int * b, int n ) { #pragma acc parallel loop for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; 23

Parallelizing Independent loops OpenMP and OpenACC (the other way) foo2( int * a, int * b, int n ) { #pragma omp parallel for for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; foo2( int * restrict a, int * restrict b, int n ) { #pragma acc kernels for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; 24

Parallelizing Independent loops Nearest equivalents to OpenACC versions foo2( int * a, int * b, int n ) { #pragma omp target teams distribute [parallel for simd] for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; 25

Parallelizing Independent loops Nearest equivalents to OpenACC versions foo2( int * a, int * b, int n ) { #pragma omp target teams distribute [simd] for( int i = 0; i<n; i++ ) { a[i] = a[i] * b[i]; 26

Parallelizing dependent loop example OpenMP and OpenACC foo2( int * a, int * b, int n ) { #pragma omp parallel for ordered for( int i = 0; i<n; i++ ) { #pragma omp ordered a[i] = a[i-1] * b[i]; foo2( int * a, int * b, int n ) { #pragma acc kernels for( int i = 0; i<n; i++ ) { a[i] = a[i-1] * b[i]; 27

1-D Stencil loop examined What is the best solution!$acc parallel loop do k = 2, n3-1!$acc loop worker do j = 2, n2-1 do i = 2, n1-1!$acc cache( u(i-1:i+1, j, k) ) r(i,j,k) = <full stencil on u> enddo enddo enddo!$omp target teams distribute do k = 2, n3-1!$omp parallel for [simd?] do j = 2, n2-1 do i = 2, n1-1 r(i,j,k) = <full stencil on u> enddo enddo enddo 28

1-D Stencil loop examined!$omp target teams distribute parallel for [simd] collapse(3) do k = 2, n3-1 do j = 2, n2-1 do i = 2, n1-1 r(i,j,k) = <full stencil on u> enddo enddo enddo 29

Sparse matrix vector multiply for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; 30

Sparse matrix vector multiply OpenMP: 1 #pragma omp parallel for for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; 31

Sparse matrix vector multiply OpenMP: 2 (when autovectorization fails) #pragma omp parallel for [simd] for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; [#pragma omp simd] for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; Which is better? Answer depends on hardware and inputs 32

Sparse matrix vector multiply OpenMP: 3 (on an accelertor ) #pragma omp target teams distribute for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; 33

Sparse matrix vector multiply OpenMP: 3.1 (what about the threads?) #pragma omp target teams distribute [parallel for simd] for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; [#pragma omp [parallel for][simd]] for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; 34

Sparse matrix vector multiply OpenMP: 3.2 (what about running on the host?) #pragma omp target teams distribute [parallel for simd] if( num_rows<min_rows ) for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; [#pragma omp [parallel for][simd]] for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; Do we really want a new team? Do we really get a new team? What if we want nested parallelism at different levels for different targets? Things just start looking strange. The language committee is working on fixing some of these issues. 35

Sparse matrix vector multiply OpenACC: 1 #pragma acc parallel loop for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; #pragma acc loop reduction(+:sum) // this is usually discovered by the compiler for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; 36

Sparse matrix vector multiply OpenACC: 2 (when autovectorization fails) #pragma acc parallel loop [gang vector] for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; [#pragma acc loop vector] for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; Which is better? Answer depends on hardware and inputs 37

Sparse matrix vector multiply OpenACC: 2.1 (when autovectorization fails) #pragma acc parallel loop gang device_type( CrayV ) vector for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; #pragma acc loop device_type(*) vector \ device_type( CrayV ) seq for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; 38

Sparse matrix vector multiply OpenACC: 2.2 (what about the host?) #pragma acc parallel loop gang if(num_rows< min_rows) for(int i=0;i<num_rows;i++) { double sum=0; int row_start=row_offsets[i]; int row_end=row_offsets[i+1]; #pragma acc loop device_type(*) vector for(int j=row_start; j<row_end;j++) { unsigned int Acol=cols[j]; double Acoef=Acoefs[j]; double xcoef=xcoefs[acol]; sum+=acoef*xcoef; ycoefs[i]=sum; What does this do? Just says run the construct on the host What does this do? Not well defined, spec says nothing about whether or not this affects the host 39

First as a member of the Cray technical staff and now as a member of the NVIDIA technical staff, I am working to ensure that OpenMP and OpenACC move towards parity whenever possible! James Beyer, Co-chair OpenMP accelerator sub-committee and OpenACC technical committee 40

Common questions people ask about this topic Why did OpenACC split off from OpenMP? The final decision came down to time to release When will the two specs merge? 4.1 shows significant effort went into merging the specs Which is better? I cannot answer this question for you, it depends on what you are trying to do, I will say that OpenMP has more compilers that support it at this time When or will <insert company name here> support OpenACC X or OpenMP 4.1? I have no idea, 42

SIMD versus Vector Why does OpenMP say SIMD while OpenACC says Vector? 1) SIMD has been coopted to mean one type of implementation 2) Vector is a synonym for SIMD without the without implying an implementation Can vector loops contain dependencies within the vector length? Only if the hardware supports forwarding results to future elements SIMT machines can do this, relatively easily!$omp simd on a dependent loop may give interesting results!$acc vector on a dependent loop should give the correct result 43

Independent loop iterations Independent loop iterations can be run in any order Dependent loops require some form a synchronization and potentially order guarantees Question: Can all algorithms be written to use independent loop iterations? Can NVIDIA gpus handle dependent loop iterations? To a limited extent yes Can Intel Xeon PHI processors handle dependent loop iterations? Yes Can processor X handle dependent loop iterations? 44