Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Similar documents
COMP Parallel Computing. Programming Accelerators using Directives

S Comparing OpenACC 2.5 and OpenMP 4.5

Directive-based Programming for Highly-scalable Nodes

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

OpenACC Fundamentals. Steve Abbott November 13, 2016

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

OPENACC ONLINE COURSE 2018

Programming paradigms for GPU devices

A case study of performance portability with OpenMP 4.5

OPENACC ONLINE COURSE 2018

OpenACC Fundamentals. Steve Abbott November 15, 2017

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Advanced OpenACC. Steve Abbott November 17, 2017

OpenACC (Open Accelerators - Introduced in 2012)

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Blue Waters Programming Environment

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

OpenMP 4.0. Mark Bull, EPCC

Introduction to Compiler Directives with OpenACC

OpenMP 4.0/4.5. Mark Bull, EPCC

Session 4: Parallel Programming with OpenMP

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Accelerator programming with OpenACC

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

OpenMP dynamic loops. Paolo Burgio.

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Early Experiences Writing Performance Portable OpenMP 4 Codes

Dynamic SIMD Scheduling

OpenMP 4.5 Validation and

OpenACC programming for GPGPUs: Rotor wake simulation

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Optimising the Mantevo benchmark suite for multi- and many-core architectures

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Introduction to OpenMP

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

OpenACC Accelerator Directives. May 3, 2013

ECE 574 Cluster Computing Lecture 10

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

High Performance Computing: Tools and Applications

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Serial and Parallel Sobel Filtering for multimedia applications

OpenACC Pipelining Jeff Larkin November 18, 2016

GPU Fundamentals Jeff Larkin November 14, 2016

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

INTRODUCTION TO OPENMP & OPENACC. Kadin Tseng Boston University Research Computing Services

Shared memory parallel computing

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

OpenMP loops. Paolo Burgio.

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

JCudaMP: OpenMP/Java on CUDA

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Lecture 2 A Hand-on Introduction to OpenMP

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

INTRODUCTION TO OPENMP & OPENACC. Kadin Tseng Boston University Research Computing Services

Task-based Execution of Nested OpenMP Loops

CS691/SC791: Parallel & Distributed Computing

Parallel Programming. Libraries and Implementations

Evaluating OpenMP s Effectiveness in the Many-Core Era

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

EPL372 Lab Exercise 5: Introduction to OpenMP

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Early Experiences With The OpenMP Accelerator Model

Cray XE6 Performance Workshop

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Open Multi-Processing: Basic Course

Overview: The OpenMP Programming Model

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Progress in automatic GPU compilation and why you want to run MPI on your GPU

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

DPHPC: Introduction to OpenMP Recitation session

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Parallel Programming

Early Experiences with the OpenMP Accelerator Model

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

Advanced OpenMP. Lecture 11: OpenMP 4.0

Exploring Dynamic Parallelism on OpenMP

Introduction to OpenMP.

Parallel Programming Libraries and implementations

STUDYING OPENMP WITH VAMPIR

OpenACC Course. Office Hour #2 Q&A

Transcription:

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

11/27/2017 Background Many developers choose OpenMP in hopes of having a single source code that runs effectively anywhere (performance portable) As of November 2017, OpenMP compilers deliver on performance, portability, and performance portability? Will OpenMP Target code be portable between compilers? Will OpenMP Target code be portable with the host? I will compare results using 6 compilers: CLANG, Cray, GCC, Intel, PGI, and XL 2

Goal of this study 1. For the GPU-enabled compilers, compare performance on a simple benchmark code. Metric = Execution time on GPU 2. Quantify each compiler s ability to performantly fallback to the host CPU In other words, if I write offloading code, will it still perform well on the host? Metric = Time Native OpenMP / Time Host Fallback 3

Compiler Versions & Flags CLANG (IBM Power8 + NVIDIA Tesla P100) clang/20170629 -O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$cuda_home Cray (Cray XC50) 8.5.5 GCC (IBM Power8 + NVIDIA Tesla P100) 7.1.1 20170718 (experimental) -O3 -fopenmp -foffload="-lm" 4

Compiler Versions & Flags Intel (Intel Haswell ) 17.0 -Ofast -qopenmp -std=c99 -qopenmp-offload=host XL (IBM Power8 + NVIDIA Tesla P100) xl/20170727-beta -O3 -qsmp -qoffload PGI (Intel CPU) 17.10 (llvm) -fast -mp -Minfo -Mllvm 5

Case Study: Jacobi Iteration 6

Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. A(i,j+1) Common, useful algorithm Example: Solve Laplace equation in 2D: 2 f(x, y) = 0 A(i-1,j) A(i,j-1) A(i,j) A(i+1,j) A k+1 i, j = A k(i 1, j) + A k i + 1, j + A k i, j 1 + A k i, j + 1 4 7

Teams & Distribute 8

OpenMP Teams TEAMS Directive To better utilize the GPU resources, use many thread teams via the TEAMS directive. OMP TEAMS Spawns 1 or more thread teams with the same number of threads Execution continues on the master threads of each team (redundantly) No synchronization between teams 9

OpenMP Teams DISTRIBUTE Directive Distributes the iterations of the next loop to the master threads of the teams. OMP TEAMS Iterations are distributed statically. There s no guarantees about the order teams will execute. No guarantee that all teams will execute simultaneously Does not generate parallelism/worksharing within the thread teams. OMP DISTRIBUTE 10

Teaming Up #pragma omp target data map(to:anew) map(a) while ( error > tol && iter < iter_max ) error = 0.0; #pragma omp target teams distribute parallel for reduction(max:error) map(error) for( int j = 1; j < n-1; j++) for( int i = 1; i < m-1; i++ ) Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); #pragma omp target teams distribute parallel for for( int j = 1; j < n-1; j++) for( int i = 1; i < m-1; i++ ) A[j][i] = Anew[j][i]; Explicitly maps arrays for the entire while loop. Spawns thread teams Distributes iterations to those teams Workshares within those teams. if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error); iter++; 11

Execution Time (seconds) Execution Time (Smaller is Better) 45 40 35 30 25 20 15 10 5 0 7.786044 1.851838 42.543545 8.930509 17.8542 11.487634 CLANG Cray GCC GCC simd XL XL simd Data Kernels Other CLANG, GCC, XL: IBM Minsky, NVIDIA Tesla P100, Cray: Cray XC-50, NVIDIA Tesla P100 12

Execution Time (seconds) Execution Time (Smaller is Better) 20 18 16 14 12 10 8 6 4 2 0 7.50616 1.47895 4.109866 17.402229 11.046584 CLANG Cray GCC simd XL XL simd Data Kernels Other CLANG, GCC, XL: IBM Minsky, NVIDIA Tesla P100, Cray: Cray XC-50, NVIDIA Tesla P100 13

Increasing Parallelism 14

11/27/2017 Increasing Parallelism Currently both our distributed and workshared parallelism comes from the same loop. We could collapse them together We could move the PARALLEL FOR to the inner loop The COLLAPSE(N) clause Turns the next N loops into one, linearized loop. This will give us more parallelism to distribute, if we so choose. 15

Collapse #pragma omp target teams distribute parallel for reduction(max:error) map(error) \ collapse(2) for( int j = 1; j < n-1; j++) for( int i = 1; i < m-1; i++ ) Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); Collapse the two loops into one and then parallelize this new loop across both teams and threads. #pragma omp target teams distribute parallel for collapse(2) for( int j = 1; j < n-1; j++) for( int i = 1; i < m-1; i++ ) A[j][i] = Anew[j][i]; 16

Execution Time (seconds) Execution Time (Smaller is Better) 45 40 35 30 25 20 15 10 5 0 1.490654 1.820148 41.812337 3.706288 CLANG Cray GCC XL Data Kernels Other CLANG, GCC, XL: IBM Minsky, NVIDIA Tesla P100, Cray: Cray XC-50, NVIDIA Tesla P100 17

Execution Time (seconds) Execution Time (Smaller is Better) 4 3.5 3 2.5 2 1.5 1 0.5 0 1.490654 1.820148 41.812337 CLANG Cray XL Data Kernels Other CLANG, GCC, XL: IBM Minsky, NVIDIA Tesla P100, Cray: Cray XC-50, NVIDIA Tesla P100 18

Splitting Teams & Parallel #pragma omp target teams distribute map(error) for( int j = 1; j < n-1; j++) #pragma omp parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); #pragma omp target teams distribute for( int j = 1; j < n-1; j++) #pragma omp parallel for for( int i = 1; i < m-1; i++ ) A[j][i] = Anew[j][i]; Distribute the j loop over teams. Workshare the i loop over threads. 19

Execution Time (seconds) Execution Time (Smaller is Better) 60 50 40 30 20 10 0 2.30662 1.94593 49.474303 12.261814 14.559997 CLANG Cray GCC GCC simd XL Data Kernels Other CLANG, GCC, XL: IBM Minsky, NVIDIA Tesla P100, Cray: Cray XC-40, NVIDIA Tesla P100 20

Execution Time (seconds) Execution Time (Smaller is Better) 16 14 12 10 8 6 4 2 0 2.033891 1.47519 7.375151 14.10688 CLANG Cray GCC simd XL Data Kernels Other CLANG, GCC, XL: IBM Minsky, NVIDIA Tesla P100, Cray: Cray XC-50, NVIDIA Tesla P100 21

Host Fallback 22

Fallback to the Host Processor Most OpenMP users would like to write 1 set of directives for host and device, but is this really possible? Using the if clause, offloading can be enabled/disabled at runtime. #pragma omp target teams distribute parallel for reduction(max:error) map(error) \ collapse(2) if(target:use_gpu) for( int j = 1; j < n-1; j++) for( int i = 1; i < m-1; i++ ) Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); Compiler must build CPU & GPU codes and select at runtime. 23

Host Fallback Comparison Native OpenMP = Standard OMP PARALLEL FOR (SIMD) Host Fallback = Device OpenMP, forced to run on host Metric: Native Time / Host Fallback Time In other words... 100% means the perform equally well 50% means the host fallback takes 2X longer 24

% of Reference CPU Threading Host Fallback vs. Host Native OpenMP 120% CLANG Cray GCC XL Intel PGI 100% 80% 60% 40% 20% 0% Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split CLANG, GCC, XL: IBM Minsky, NVIDIA Tesla P100, Cray: Cray XC-50, NVIDIA Tesla P100, Intel, PGI: Intel Haswell 25

Conclusions 26

Conclusions Will OpenMP Target code be portable between compilers? Maybe. Compilers are of various levels of maturity. SIMD support/requirement inconsistent. Will OpenMP Target code be portable with the host? Highly compiler-dependent. Intel, PGI, and XL do this very well, CLANG somewhat well, and GCC and Cray did poorly. Future Work: Revisit these experiments as compilers are updated. 27