Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

Size: px

Start display at page:

Download "Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs"

Austen Peters
6 years ago
Views:

1 Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct

2 PROLOGUE

Barcelona Supercomputing Center Marenostrum 4 13.

Xeon, 11 PF/s Emerging Technologies Power 9 + Pascal 1.

3 Barcelona Supercomputing Center Marenostrum PetaFlop/s General Purpose Computing 3400 nodes of Xeon, 11 PF/s Emerging Technologies Power 9 + Pascal 1.5 PF/s Knights Landing and Knights Hill 0.5 PF/s 64bit ARMv8 0.5 PF/s 3

Mission of BSC Scientific Departments COMPUTER SCIENCES To influence the way machines

computer architecture, energy efficiency EARTH SCIENCES To develop and implement

long-term climate applications LIFE SCIENCES To understand living organisms by means

CASE To develop scientific and engineering software to efficiently exploit

4 Mission of BSC Scientific Departments COMPUTER SCIENCES To influence the way machines are built, programmed and used: programming models, performance tools, Big Data, computer architecture, energy efficiency EARTH SCIENCES To develop and implement global and regional state-of-the-art models for short-term air quality forecast and long-term climate applications LIFE SCIENCES To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics) CASE To develop scientific and engineering software to efficiently exploit supercomputing capabilities (biomedical, geophysics, atmospheric, energy, social and economic simulations) 4

BSC Training on European Level - PATC PRACE Advanced Training

Barcelona Supercomputing Center (Spain) CINECA Consorzio

(Finland), EPCC at the University of Edinburgh (UK), Gauss Centre

Mission of PATCs Carry out and coordinate training and education

5 BSC Training on European Level - PATC PRACE Advanced Training Centers The PRACE, designated 6 Advanced Training Centers: Barcelona Supercomputing Center (Spain) CINECA Consorzio Interuniversitario (Italy), CSC - IT Center for Science Ltd (Finland), EPCC at the University of Edinburgh (UK), Gauss Centre for Supercomputing (Germany) and Maison de la Simulation (France). Mission of PATCs Carry out and coordinate training and education activities that foster the efficient usage of the infrastructure available through PRACE. 5

6 BSC & The Global IT Industry 2016 IBM-BSC Deep Learning Center NVIDIA GPU Center of Excellence BSC-Microsoft Research Centre Intel-BSC Exascale Lab 6

7 Projects with the Energy Industry Repsol-BSC Research Center Research into advanced technologies for the exploration of hydrocarbons, subterranean and subsea reserve modelling and fluid flows Iberdrola Renovables 7

8 BSC/UPC NVIDIA GPU Center of Excellence NVIDIA Award to BSC/UPC (since 2011) R&D around GPU Computing (currently ~10 core collaborators) Architecture, Programming Models, Libraries, Applications, Porting Education, Training, Dissemination (free registration) PUMPS Summer School advanced CUDA mainly PRACE Adv. Training Center courses on Introduction to CUDA & OpenACC Severo Ochoa Seminars on Deep Learning & Image/Video Processing Always open to research collaborations, internships, advising, hiring 8

9 Introductions Pau Farre, Jr. Engineer GCoE Core Team GPU porting and optimization specialist Did most of the hard work for this lab Antonio J. Peña, Sr. Researcher Manager of the GCoE Juan de la Cierva Fellow Prospective Marie Curie Fellow Activity Leader Accelerators and Communications for HPC The one to blame if anything goes wrong 9

10 Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture) Runtime & Driver APIs (high-level / low-level) Specific for NVIDIA GPUs: best performance & control OpenACC (Open Accelerators) Open Standard Higher-level, pragma-based Aiming at portability heterogeneous hardware For NVIDIA GPUs, implemented on top of CUDA OpenCL(Open Computing Language) Open Standard Low-level similar to CUDA Driver API Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL, ) 10

11 Motivation: Coding Productivity & Performance CUDA OpenACC OpenACC + CUDA OmpSs + CUDA OmpSs + OpenACC Don t get me wrong: CUDA delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) colors here. Please interpret colors as relative to each other. OpenACC may well deliver more than the performance you *need*. However, we have the lowest control on performance w.r.t. the discussed alternatives. High-level, task-based, pragma-based, BSC Target accelerators combined with CUDA or (recently) OpenACC Coding Prod. / Perf. 11

12 HANDS-ON

13 LAB CONNECTION INSTRUCTIONS - Part 1 Go to nvlabs.qwiklab.com Sign in or create an account Check for Access Codes (each day): - Click My Account - Click Credits & Subscriptions If no Access Codes, ask for paper one from TA. Please tear in half once used An Access Code is needed to start the lab WIFI SSID: GTC_Hands_On Password: HandsOnGpu

14 LAB CONNECTION INSTRUCTIONS - Part 2 1. Click Qwiklabs in upper-left 2. Select GTC2017 Class 3. Find lab and click on it 4. Click on Select Click Start Lab 4 WIFI SSID: GTC_Hands_On Password: HandsOnGpu 3

15 Steps to Parallelize with OpenACC 1. Identify Parallelism Using a CPU profiling tool (example: nvprof cpu-profiling on) 2. Express Parallelism Declare parallel regions with directives 3. Express Data Locality Help OpenACC figure out how to manage data 4. Optimize Using nvprof & Nvidia visual profiler 15

FWI A Full Wave Inversion Oil & Gas (mini-)application Analyzes physical properties of the subsoil from seismic measures Elastic wave propagator + linearly elastic

16 FWI A Full Wave Inversion Oil & Gas (mini-)application Analyzes physical properties of the subsoil from seismic measures Elastic wave propagator + linearly elastic stress-strain relationships Six different stress components Finite differences (FD) method with a Fully Staggered Grid (FSG) Base code developed by the BSC Repsol Team 16

17 Speedup FWI Parallelization OpenACC/CUDA #6: Results Our optimized CUDA Kernels have better performance than the OpenACC FWI Speedups Baseline: OpenMP 19,32 20,00 18,00 16,00 14,00 12,00 10,00 8,00 6,00 4,00 2,00 0,00 1,00 11,76 11,69 6,46 7,15 1,74 3,02 0,60 0,51 12,17 12,72 7,07 7,16 3,06 3,11 0,92 9,29 3,82 Xeon Platinium 8160 (23c) Tesla K40 (Kepler) Titan X (Maxwell) Tesla P100 (Pascal) 17

18 OmpSs + CUDA / OpenACC

19 OmpSs Main Program Sequential control flow Defines a single address space Executes sequential code that Can spawn/instantiate tasks that will be executed sometime in the future Can stall/wait for tasks Tasks annotated with directionality clauses in, out, inout Used To build dependences among tasks For main to wait for data to be produced Basis for memory management functionalities (replication, locality, movement, Copy clauses Sequential equivalence (~) 19

20 OmpSs: A Sequential Program void Cholesky( float *A[NT][NT] ) { int i, j, k; for (k=0; k<nt; k++) { spotrf (A[k*NT+k]) ; for (i=k+1; i<nt; i++) { NT NT TS TS TS TS strsm (A[k][k], A[k][i]); for (i=k+1; i<nt; i++) { for (j=k+1; j<i; j++) { sgemm( A[k][i], A[k][j], A[j][i]); ssyrk (A[k][i], A[i][i]); 20

21 OmpSs: with Directionality Annotations void Cholesky( float *A[NT][NT] ) { int i, j, k; for (k=0; k<nt; k++) { #pragma omp task inout (A[k][k]) spotrf (A[k][k]) ; for (i=k+1; i<nt; i++) { #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); for (i=k+1; i<nt; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); NT NT TS TS TS TS 21

22 OmpSs: that Happens to Execute in Parallel void Cholesky( float *A[NT][NT] ) { int i, j, k; for (k=0; k<nt; k++) { #pragma omp task inout (A[k][k]) spotrf (A[k][k]) ; for (i=k+1; i<nt; i++) { #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); for (i=k+1; i<nt; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); NT NT TS TS TS TS Decouple how we write/think (sequential) from how it is executed 22

$saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i){ if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.$ $float *Y) { for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) global$

23 OmpSs + CUDA Example: AXPY Algorithm 1 Port kernel to CUDA 2 Annotate device (cuda) 3 Complete device (smp) #include <kernel.h> int main(int argc, char *argv[]) { float a=5, x[n], y[n]; // Initialize values for (int i=0; i<n; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i){ if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.c #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y); void saxpy(int n, float a, float *X, float *Y) { for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) global void saxpy(int n, float a, float* x, float* y); kernel.h kernel.c kernel.cuh 1 kernel.cu global void saxpy(int n, float a, float* x, float* y) { int i = blockidx.x * blockdim.x + threadidx.x; if(i < n) y[i] = a * x[i] + y[i]; 3 23

24 OmpSs + OpenACC: General Idea Taskify all your application in a data-flow manner Process kernels are just a type of tasks executed inside a GPU The OmpSs runtime manages automatically the use of streams & memory transfers OpenACC directives are used to generate all GPU kernels that will be treated as a CUDA tasks by OmpSs Greatest coding productivity for accelerators! But OpenACC kernels might perform lower than fine-tuned CUDA 24

25 OmpSs + OpenACC: Syntax #pragma omp target(openacc) #pragma omp task in(rho, sxptr, syptr, szptr) inout(vptr) #pragma acc parallel loop deviceptr(rho, sxptr, syptr, szptr, vptr) for (int y=ny0; y < nyf; y++) { for (int x=nx0; x < nxf; x++) { for (int z=nz0; z < nzf; z++) { code Not released yet 25

26 Speedup FWI Parallelization OmpSs/OpenACC - Results OmpSs/OpenACC performance is similar to OpenACC FWI Speedups Baseline: OpenMP 25,00 20,00 19,32 15,00 10,00 5,00 0,00 11,76 11,69 6,46 7,15 1,00 1,74 3,02 0,92 0,60 0,51 12,17 12,72 9,29 10,58 7,07 7,16 6,58 3,06 3,11 3,82 3,32 Xeon Platinium 8160 (23c) Tesla K40 (Kepler) Titan X (Maxwell) Tesla P100 (Pascal) 26

27 Your Turn! Open Follow step-by-step GTC2017eu.md 27

28 Thank you! For further information please contact

MultiGPU Made Easy by OmpSs + CUDA/OpenACC

www.bsc.es MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUD (Compute