MultiGPU Made Easy by OmpSs + CUDA/OpenACC

Size: px

Start display at page:

Download "MultiGPU Made Easy by OmpSs + CUDA/OpenACC"

Lucinda Stafford
5 years ago
Views:

1 MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018

2 Introduction: Programming Models for GPU Computing CUD (Compute Unified Device rchitecture) Runtime & Driver PIs (high-level / low-level) Specific for NVIDI GPUs: best performance & control OpenCC (Open ccelerators) Open Standard Higher-level, pragma-based iming at portability heterogeneous hardware For NVIDI GPUs, implemented on top of CUD OpenCL (Open Computing Language) Open Standard Low-level similar to CUD Driver PI Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL, ) 2

3 Motivation: Coding Productivity & Performance CUD OpenCC OpenCC + CUD OmpSs + CUD OmpSs + OpenCC OmpSs + OpenCC + CUD Don t get me wrong: CUD delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) colors here. Please interpret colors as relative to each other OpenCC may well deliver more than the performance you *need*. However, we have the lowest control on performance w.r.t. the discussed alternatives Coding Prod. / Perf. 3

4 EPEEC, an EU H2020 Project European joint Effort toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing FETHPC, 3 years, ~4M, Starting October 2018 Subtopic: High productivity programming environments for exascale 10 Partners; Coordinator: SC (I m the Technical Manager) High-level Objectives: Develop & deploy a production-ready parallel programming environment dvance and integrate existing state-of-the-art European technology High coding productivity, high performance, energy awareness

5 Proposed Methodology for pplication Developers utomatic Code nnotation Satisfactory Performance? No Profile Directive Optimisation Possible? No Yes Update code patterns Yes Tune/Insert Directives Manually Code Low-Level ccelerator Kernels No Yes Satisfactory Code Patterns? No Satisfactory Performance? Start Deploy Yes 5

6 OmpSs + CUD / OpenCC

7 OmpSs Main Program Sequential control flow Defines a single address space Executes sequential code that Can spawn/instantiate tasks that will be executed sometime in the future Can stall/wait for tasks Tasks annotated with directionality clauses in, out, inout Used To build dependences among tasks For main to wait for data to be produced asis for memory management functionalities (replication, locality, movement, ) Copy clauses 7

8 OmpSs: Sequential Program void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) spotrf ([k*nt+k]) ; for (i=k+1; i<nt; i++) NT NT TS TS TS TS strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) sgemm( [k][i], [k][j], [j][i]); ssyrk ([k][i], [i][i]); 8

9 OmpSs: with Directionality nnotations void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) #pragma omp task inout ([k][k]) spotrf ([k][k]) ; for (i=k+1; i<nt; i++) #pragma omp task in ([k][k]) inout ([k][i]) strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) #pragma omp task in ([k][i], [k][j]) inout ([j][i]) sgemm( [k][i], [k][j], [j][i]); #pragma omp task in ([k][i]) inout ([i][i]) ssyrk ([k][i], [i][i]); NT NT TS TS TS TS 9

OmpSs: that Happens to Execute in Parallel void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) #pragma omp task inout ([k][k]) spotrf ([k][k]) ; for (i=k+1; i<nt; i++) #pragma omp task

10 OmpSs: that Happens to Execute in Parallel void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) #pragma omp task inout ([k][k]) spotrf ([k][k]) ; for (i=k+1; i<nt; i++) #pragma omp task in ([k][k]) inout ([k][i]) strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) #pragma omp task in ([k][i], [k][j]) inout ([j][i]) sgemm( [k][i], [k][j], [j][i]); #pragma omp task in ([k][i]) inout ([i][i]) ssyrk ([k][i], [i][i]); NT NT TS TS TS TS Decouple how we write/think (sequential) from how it is executed 10

Memory Consistency (Getting Consistent Copies) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j

11 Memory Consistency (Getting Consistent Copies) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T1 needs a valid copy of array in the device lso it allocates array in the device (no copy needed), and invalidates other s DEVICE T1 Task Dependency Graph T2 T1 T3 Memory Transfers No need to copy HOST C 11

12 Memory Consistency (Reusing Data in Place) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T2 can reuse arrays and, due they have been used by previous task (T1) dditionally it also invalidates others s DEVICE T2 Task Dependency Graph T2 T1 T3 Memory Transfers HOST C 12

13 Memory Consistency (on Demand Copy Data ack) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T3 needs to copy back to the host array Does not invalidate the existing copy in the device DEVICE Task Dependency Graph T2 T1 T3 Memory Transfers HOST C T3 13

14 Memory Consistency (Centralized Memory Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c Taskwait requires full memory consistency in the host T1 T2 T3 TW DEVICE Task Dependency Graph T2 T1 T3 Memory Transfers HOST C 14

15 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c Taskwait is waiting for task finalization, but does not copy memory back to the host (neither invalidate it) T1 T2 T3 DEVICE Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 15

16 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c efore executing T4 it will need a consistent copy of C and it will also invalidate all previous versions of T1 T2 T3 T4 DEVICE C Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 16

Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j

17 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c Taskwait waits for tasks finalization, it will invalidate all data versions and force memory consistency T1 T2 T3 T4 TW DEVICE C Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 17

OmpSs + CUD Example: XPY lgorithm 1 Port kernel to CUD 2 nnotate device (cuda) 3 Complete device (smp) #include <kernel.

$y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.$ *Y) for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) global void saxpy(int

*Y) for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) global void saxpy(int

18 OmpSs + CUD Example: XPY lgorithm 1 Port kernel to CUD 2 nnotate device (cuda) 3 Complete device (smp) #include <kernel.h> int main(int argc, char *argv[]) float a=5, x[n], y[n]; // Initialize values for (int i=0; i<n; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.c #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y); void saxpy(int n, float a, float *X, float *Y) for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) global void saxpy(int n, float a, float* x, float* y); kernel.h kernel.c kernel.cuh 1 kernel.cu global void saxpy(int n, float a, float* x, float* y) int i = blockidx.x * blockdim.x + threadidx.x; if(i < n) y[i] = a * x[i] + y[i]; 3 So easy! Difficult for non-experienced programmers! 18

19 OmpSs + OpenCC: Motivation What if we could use OpenCC directives with OmpSs? OpenCC is way easier than CUD Instead of porting & optimizing many CUD tasks port every GPU-accelerated task using using OpenCC and only use CUD where the OpenCC compiler doesn t provide the required efficiency 19

OmpSs + OpenCC Example: SXPY lgorithm 1 Port kernel to CUD 2 nnotate device (openacc) 3 Complete device (smp) #include <kernel.

$saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.$ *x, float *y) for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; #pragma omp target device(openacc) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n,

*x, float *y) for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; #pragma omp target device(openacc) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n,

20 OmpSs + OpenCC Example: SXPY lgorithm 1 Port kernel to CUD 2 nnotate device (openacc) 3 Complete device (smp) #include <kernel.h> int main(int argc, char *argv[]) float a=5, x[n], y[n]; // Initialize values for (int i=0; i<n; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.c #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y); void saxpy(int n, float a, float *x, float *y) for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; #pragma omp target device(openacc) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y); void saxpy(int n, float a, float *x, float *y) #pragma acc kernels for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; So easy! 3 2 kernel.h kernel.c So easy! kernel.h kernel.c So easy! 20

FWI Full Wave Inversion Oil & Gas Miniapplication nalyzes physical properties of the subsoil from seismic measures Elastic wave propagator + linearly elastic

21 FWI Full Wave Inversion Oil & Gas Miniapplication nalyzes physical properties of the subsoil from seismic measures Elastic wave propagator + linearly elastic stress-strain relationships Six different stress components Finite differences (FD) method with a Fully Staggered Grid (FSG) ase code developed by the SC Repsol Team 21

22 3,27 0,97 1,13 1,29 Speedup 5,68 5,75 5,84 7,18 6,25 12,15 14,96 13,44 15,95 13,29 16,52 13,46 16,57 12,37 17,47 19,08 18,18 FWI Parallelization OmpSs/OpenCC - Results FWI Speedups aseline: OpenMP 25,00 20,00 15,00 10,00 5,00 0,00 1,00 1,00 i7-5930k (6c) Tesla K40 (Kepler) Titan X (Maxwell) Titan X (Pascal) 22

23 Some nnouncements L8116 est GPU Code Practices Combining OpenCC, CUD, and OmpSs Thu. 10:00-12:00 S8328 One More Step Towards the Simulation of the Human rain on NVIDI GPUs (HP) Thu. 4:00-4:25pm Join Upcoming EPEEC s Users Group STRS Open Postdoctoral Fellowships PUMPS+I Summer School arcelona, July Featuring Wen-mei Hwu && David Kirk dvanced CUD + rand-new +I format! antonio.pena@bsc.es

cknowledgements Guray Ozen First OmpSs+OpenCC prototype ccelerators and Communications for HPC Team my team Core: Pau Farré, Marc Jordà, Kyunghun Kim, Mohammad Owais Collaborators: Pedro Valero, imar

24 cknowledgements Guray Ozen First OmpSs+OpenCC prototype ccelerators and Communications for HPC Team my team Core: Pau Farré, Marc Jordà, Kyunghun Kim, Mohammad Owais Collaborators: Pedro Valero, imar Rodríguez, Jan Ciesko OmpSs Team wesome programming moldel and runtime Xavier Martorell, Vicenç eltran, Xavier Teruel, Sergi Mateo, JM Perez, SC Repsol Team Providing original FWI implementation Maurizio Hanzich, Samuel Rodríguez, PUMPS Summer School Wen-mei Hwu & Ts: Simón García de Gonzalo, bdul Dakkak, Carl Pearson, Mert Hiyadetoglu David Kirk, Juan Gómez-Luna Pau Farré Jr. Research Engineer 24

25 Thank you! For further information please contact

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose