Directive-based Programming for Highly-scalable Nodes

Similar documents
COMP Parallel Computing. Programming Accelerators using Directives

S Comparing OpenACC 2.5 and OpenMP 4.5

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

Programming paradigms for GPU devices

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Fundamentals. Steve Abbott November 15, 2017

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

OPENACC ONLINE COURSE 2018

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Introduction to Compiler Directives with OpenACC

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Programming NVIDIA GPUs with OpenACC Directives

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

OpenACC (Open Accelerators - Introduced in 2012)

OPENACC ONLINE COURSE 2018

A case study of performance portability with OpenMP 4.5

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

OpenMP 4.0/4.5. Mark Bull, EPCC

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Blue Waters Programming Environment

Advanced OpenACC. Steve Abbott November 17, 2017

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

PROFILER OPENACC TUTORIAL. Version 2018

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

PGPROF OpenACC Tutorial

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

OpenMP 4.0. Mark Bull, EPCC

OpenACC Course. Office Hour #2 Q&A

An Introduction to OpenACC - Part 1

OpenACC programming for GPGPUs: Rotor wake simulation

CACHE DIRECTIVE OPTIMIZATION IN THE OPENACC PROGRAMMING MODEL. Xiaonan (Daniel) Tian, Brent Leback, and Michael Wolfe PGI

Optimising the Mantevo benchmark suite for multi- and many-core architectures

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Accelerator programming with OpenACC

GPU Computing with OpenACC Directives Presented by Bob Crovella. Authored by Mark Harris NVIDIA Corporation

GPU Computing with OpenACC Directives

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Progress in automatic GPU compilation and why you want to run MPI on your GPU

INTRODUCTION TO OPENACC

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Advanced OpenMP. Lecture 11: OpenMP 4.0

GPU Fundamentals Jeff Larkin November 14, 2016

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Evaluating OpenMP s Effectiveness in the Many-Core Era

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Lecture 4: OpenMP Open Multi-Processing

Parallel Programming Libraries and implementations

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

OpenACC Accelerator Directives. May 3, 2013

Parallel Compu,ng with OpenACC

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

OpenMP 4.0 (and now 5.0)

Parallel Programming. Libraries and Implementations

OpenACC and the Cray Compilation Environment James Beyer PhD

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

An Introduction to OpenAcc

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

John Levesque Nov 16, 2001

EE/CSCI 451: Parallel and Distributed Computation

Introduction to OpenACC

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Mapping MPI+X Applications to Multi-GPU Architectures

Early Experiences Writing Performance Portable OpenMP 4 Codes

arxiv: v1 [hep-lat] 12 Nov 2013

Deutscher Wetterdienst

Parallel Programming in C with MPI and OpenMP

CME 213 S PRING Eric Darve

LLVM for the future of Supercomputing

ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC

OpenMP 4 (and now 5.0)

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

Tesla Architecture, CUDA and Optimization Strategies

OpenMP: Open Multiprocessing

Introduction to OpenACC Directives. Duncan Poole, NVIDIA

Transcription:

Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016

Talk Outline Increasingly Parallel Nodes Exposing Parallelism - the Role of Directives Expressing Parallelism in OpenMP and OpenACC Using High-bandwidth Memory Directives and Scalability

Node-level Parallelism ASCI White Tens of FP64 Lanes Tianhe-2 Titan Trinity Cori Hundreds of FP64 Lanes Summit Sierra Thousands & Thousands of FP64 Lanes 2001 2013 2016 2018

The Latest HPC Processors TESLA P100 22 Cores 88 FP64 lanes 176 FP32 lanes 72 Cores 1152 FP64 lanes 2304 FP32 lanes 56 SMs 1792 FP64 lanes 3584 FP32 lanes Lanes of Execution

Exposing parallelism in your code. Fill the Lanes!

Exposing parallelism: OpenACC KERNELS as a porting tool!$acc kernels do j = 1, m do i = 1, n a(j,i) = b(j,i)*alpha + c(i,j)*beta enddo enddo!$acc end kernels % pgf90 a.f90 -ta=multicore -c Minfo sub: 10, Loop is parallelizable Generating Multicore code 10,!$acc loop gang 11, Loop is parallelizable % pgf90 a.f90 -ta=tesla -c Minfo sub: 9, Generating present(a(:,:),b(:,:),c(:,:)) 10, Loop is parallelizable 11, Loop is parallelizable Accelerator kernel generated Generating Tesla code 10,!$acc loop gang, vector(4) 11,!$acc loop gang, vector(32) GPU

OpenACC KERNELS as a construct % pgfortran fast acc Minfo -c advec_mom_kernel.f90 advec_mom_kernel: 167, Loop is parallelizable 169, Loop is parallelizable Accelerator kernel generated Generating Tesla code 167,!$acc loop gang, vector(4)! blockidx%y! threadidx%y 169,!$acc loop gang, vector(32)! blockidx%x!threadidx%x 96!$ACC KERNELS 167!$ACC LOOP INDEPENDENT 168 DO k=y_min,y_max+1 169!$ACC LOOP INDEPENDENT PRIVATE(upwind,downwind,donor,dif,sigma, width,limiter,vdiffuw,vdiffdw,auw,adw,wind) 170 DO j=x_min-1,x_max+1 171 IF(node_flux(j,k).LT.0.0)THEN 172 upwind=j+2 173 donor=j+1 174 downwind=j 175 dif=donor 176 ELSE 177 upwind=j-1 178 donor=j 179 downwind=j+1 180 dif=upwind 181 ENDIF 182 sigma=abs(node_flux(j,k))/(node_mass_pre(donor,k)) 183 width=celldx(j) 184 vdiffuw=vel1(donor,k)-vel1(upwind,k) 185 vdiffdw=vel1(downwind,k)-vel1(donor,k) 186 limiter=0.0 187 IF(vdiffuw*vdiffdw.GT.0.0)THEN 188 auw=abs(vdiffuw) 189 adw=abs(vdiffdw) 190 wind=1.0_8 191 IF(vdiffdw.LE.0.0) wind=-1.0_8 192 limiter=wind*min(width*((2.0_8-sigma)*adw/width+(1.0_8+sigma)* auw/celldx(dif))/6.0_8,auw,adw) 193 ENDIF 194 advec_vel(j,k)=vel1(donor,k)+(1.0-sigma)*limiter 195 mom_flux(j,k)=advec_vel(j,k)*node_flux(j,k) 196 ENDDO 197 ENDDO

Speedup vs single Core The OpenACC version of CloverLeaf uses KERNELS exclusively 12x 10x 8x 6x 4x 2x 0x : Intel OpenMP : PGI OpenACC + GPU: PGI OpenACC minighost (Mantevo) - MPI 30X CLOVERLEAF (Physics) - MPI 359.miniGhost: : Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80 (single GPU) CLOVERLEAF: : Dual socket Intel Xeon E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs

OpenACC KERNELS as a construct % pgfortran fast acc Minfo -c advec_mom_kernel.f90 advec_mom_kernel: 167, Loop is parallelizable 169, Loop is parallelizable Accelerator kernel generated Generating Tesla code 167,!$acc loop gang, vector(4)! blockidx%y! threadidx%y 169,!$acc loop gang, vector(32)! blockidx%x!threadidx%x 96!$ACC KERNELS 167!$ACC LOOP INDEPENDENT 168 DO k=y_min,y_max+1 169!$ACC LOOP INDEPENDENT PRIVATE(upwind,downwind,donor,dif,sigma, width,limiter,vdiffuw,vdiffdw,auw,adw,wind) 170 DO j=x_min-1,x_max+1 171 IF(node_flux(j,k).LT.0.0)THEN 172 upwind=j+2 173 donor=j+1 174 downwind=j 175 dif=donor 176 ELSE 177 upwind=j-1 178 donor=j 179 downwind=j+1 180 dif=upwind 181 ENDIF 182 sigma=abs(node_flux(j,k))/(node_mass_pre(donor,k)) 183 width=celldx(j) 184 vdiffuw=vel1(donor,k)-vel1(upwind,k) 185 vdiffdw=vel1(downwind,k)-vel1(donor,k) 186 limiter=0.0 187 IF(vdiffuw*vdiffdw.GT.0.0)THEN 188 auw=abs(vdiffuw) 189 adw=abs(vdiffdw) 190 wind=1.0_8 191 IF(vdiffdw.LE.0.0) wind=-1.0_8 192 limiter=wind*min(width*((2.0_8-sigma)*adw/width+(1.0_8+sigma)* auw/celldx(dif))/6.0_8,auw,adw) 193 ENDIF 194 advec_vel(j,k)=vel1(donor,k)+(1.0-sigma)*limiter 195 mom_flux(j,k)=advec_vel(j,k)*node_flux(j,k) 196 ENDDO 197 ENDDO

and as a path to standard languages Fortran 2015 DO CONCURRENT + True Parallel Loops + Loop-scope shared/private data No support for reductions No support for data regions 168 DO CONCURRENT k=y_min,y_max+1 169 DO CONCURRENT j=x_min-1,x_max+1 LOCAL(upwind,downwind,donor,dif, sigma,width,limiter,vdiffuw, vdiffdw,auw,adw,wind) 171 IF(node_flux(j,k).LT.0.0)THEN 172 upwind=j+2 173 donor=j+1 174 downwind=j 175 dif=donor 176 ELSE 177 upwind=j-1 178 donor=j 179 downwind=j+1 180 dif=upwind 181 ENDIF 182 sigma=abs(node_flux(j,k))/(node_mass_pre(donor,k)) 183 width=celldx(j) 184 vdiffuw=vel1(donor,k)-vel1(upwind,k) 185 vdiffdw=vel1(downwind,k)-vel1(donor,k) 186 limiter=0.0 187 IF(vdiffuw*vdiffdw.GT.0.0)THEN 188 auw=abs(vdiffuw) 189 adw=abs(vdiffdw) 190 wind=1.0_8 191 IF(vdiffdw.LE.0.0) wind=-1.0_8 192 limiter=wind*min(width*((2.0_8-sigma)*adw/width+(1.0_8+sigma)* auw/celldx(dif))/6.0_8,auw,adw) 193 ENDIF 194 advec_vel(j,k)=vel1(donor,k)+(1.0-sigma)*limiter 195 mom_flux(j,k)=advec_vel(j,k)*node_flux(j,k) 196 ENDDO 197 ENDDO

Expressing parallelism in OpenMP and OpenACC. Fill the Lanes!

Prescribing vs Describing* OpenMP while ( error > tol && iter < iter_max ) { OpenACC while ( error > tol && iter < iter_max ) { #pragma omp parallel for reduction(max:error) #pragma omp simd Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); #pragma acc parallel loop reduction(max:error) #pragma acc loop reduction(max:error) Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); #pragma omp parallel for #pragma omp simd A[j][i] = Anew[j][i]; #pragma acc parallel loop #pragma acc loop A[j][i] = Anew[j][i]; if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); * Performance Portability Through Descriptive Parallelism, Jeff Larkin, NVIDIA, DOE Centers of Excellence Performance Portability Meeting, April 29-21, 2016

Prescribing vs Describing* OpenMP while ( error > tol && iter < iter_max ) { OpenACC while ( error > tol && iter < iter_max ) { #pragma omp parallel target teams for distribute reduction(max:error) parallel for \ reduction(max:error) for( int j = 1; j < collapse(2) n-1; j++) { schedule(static,1) #pragma for( omp int simd j = 1; j < n-1; j++) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); #pragma acc parallel loop reduction(max:error) #pragma acc loop reduction(max:error) Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(anew[j][i] - A[j][i])); #pragma omp parallel target teams for distribute parallel for \ collapse(2) for( int j schedule(static,1) = 1; j < n-1; j++) { #pragma for( omp int simd j = 1; j < n-1; j++) { A[j][i] = Anew[j][i]; GPU #pragma acc parallel loop #pragma acc loop A[j][i] = Anew[j][i]; GPU if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); if(iter++ % 100 == 0) printf("%5d, %0.6f\n", iter, error); * Performance Portability Through Descriptive Parallelism, Jeff Larkin, NVIDIA, DOE Centers of Excellence Performance Portability Meeting, April 29-21, 2016

Performance Impact of Explicit OpenMP Scheduling* Time in Seconds 180s 120 100 NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @ 3.00GHz - Intel C Compiler 15.0, CLANG pre-3.8, PGI 16.3 1.6X 1.8X 80 2.4X 60 40 20 16.5X 8.8X 29.8X 8.5X 39.5X 0 icc Serial icc OpenMP icc OpenMP (interleaved) clang OpenMP clang OpenMP (interleaved) clang GPU OpenMP clang GPU OpenMP (interleaved) pgcc OpenACC pgcc GPU OpenACC * Performance Portability Through Descriptive Parallelism, Jeff Larkin, NVIDIA, DOE Centers of Excellence Performance Portability Meeting, April 29-21, 2016

Using High-bandwidth Memory

Using High-bandwidth Memory KNL H B W GPU High Capacity Memory High Capacity Memory High Capacity Memory HBM while ( error > tol && iter < iter_max ) { #pragma omp parallel for #pragma omp simd double *A = hbw_malloc (sizeof(double)*n*m) double *Anew = hbw_malloc (sizeof(double)*n*m) while ( error > tol && iter < iter_max ) { #pragma omp parallel for #pragma omp simd #pragma acc data copy(a) create(anew) while ( error > tol && iter < iter_max ) { #pragma acc parallel loop #pragma acc loop

Using High-bandwidth Memory KNL H B W GPU High Capacity Memory while ( error > tol && iter < iter_max ) { #pragma omp parallel for #pragma omp simd High Capacity Memory double *A = hbw_malloc (sizeof(double)*n*m) double *Anew = hbw_malloc (sizeof(double)*n*m) while ( error > tol && iter < iter_max ) { #pragma omp parallel for #pragma omp simd High Capacity Memory NV Unified Memory HBM #pragma acc data copy(a) create(anew) while ( error > tol && iter < iter_max ) { #pragma acc parallel loop #pragma acc loop

Using High-bandwidth Memory KNL H B W GPU High Capacity Memory High Capacity Memory High Capacity Memory NV Unified Memory HBM OpenACC? OpenACC? #pragma acc data copy(a) create(anew) while ( error > tol && iter < iter_max ) { #pragma acc parallel loop #pragma acc loop

Using High-bandwidth Memory KNL H B W GPU High Capacity Memory High Capacity Memory High Capacity Memory NV Unified Memory HBM #pragma acc data copy(a) create(anew) while ( error > tol && iter < iter_max ) { #pragma acc parallel loop #pragma acc loop #pragma acc data copy(a) create(anew) while ( error > tol && iter < iter_max ) { #pragma acc parallel loop #pragma acc loop #pragma acc data copy(a) create(anew) while ( error > tol && iter < iter_max ) { #pragma acc parallel loop #pragma acc loop

OpenACC and CUDA Unified Memory on Kepler 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% 100% = OpenACC = OpenACC + CUDA Unified Memory Kepler: only dynamic memory allocation uses CUDA Unified Memory Pascal: can place Global, Stack, Static data in Unified Memory Kepler: Much explicit data movement eliminated, simplifies initial porting Pascal: OpenACC user-directed data movement becomes an optimization OpenACC directive-based data movement vs OpenACC + CUDA UM on Haswell+Kepler Broadening GPU application space, improving performance Pascal: Demand-paging Pascal: Oversubscription Pascal: Reduced Synchronization

Directives and scalability or not.

Synchronization in OpenMP and OpenACC Synchronization was intentionally minimized in the design of OpenACC: reductions & atomics only OpenMP is rich with synchronization features, most designed for modestly parallel systems NOTE: Fortran 2015 DO CONCURRENT and C++17 par_vec constructs offer little or no synchronization

Non-scalable OpenMP Features simd(safelen) MASTER SECTIONS BARRIER SINGLE ORDERED CRITICAL TASK TASKLOOP Cancellation Locks DEPEND

Non-scalable OpenMP Features simd(safelen) MASTER SECTIONS BARRIER SINGLE ORDERED CRITICAL TASK TASKLOOP Cancellation Locks DEPEND

Concluding Thoughts Descriptive directives maximize productivity and performance portability We can (and should) use directives to manage highbandwidth memory Legacy OpenMP synchronization features will inhibit on-node scalability proceed with caution