Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda

Size: px

Start display at page:

Download "Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda"

Fay Powers
5 years ago
Views:

1 Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda Trondheim, 3 October 2013

2 Outline OmpSs Programming Model StarSs Comparison with OpenMP Heterogeneity (OpenCL and CUDA support) Hybrid MPI+OmpSs Why? Advantages Communication overlapping OpenCL Integration in OmpSs Directives Fortran support Performance and scalability Contact: Source code available from 2

3 OmpSs Programming Model StarSs StarSs: a family of task based programming models. SMPSs CellSs GPUSs GridSs Basic concept: write sequential on a flat single address space + directionality annotations Data dependency synchronisation Asynchronous calls OmpSs combines StarSs features into a single programming model 3

4 OmpSs Programming Model Comparison with OpenMP 3.0 (I) Dataflow is specified using data dependencies No need for parallel regions Explicit synchronization point if required (taskwait) Different task scheduling policies: Global or local task queues (wf, bf) Data locality (NUMA, GPU, cluster) Multiple implementations of the same task (versioning) 4

5 OmpSs Programming Model Comparison with OpenMP 3.0 (II) CUDA and OpenCL support Copy clauses allow to work with multiple address spaces The programmer has the perception of a single address space Runtime library will keep consistency in between memories The compiler tool-chain allows to work with heterogeneus code Multiple device architectures Different implementations of the same task (in the same or different devices) 5

6 OmpSs Programming Model Comparison with OpenMP 4.0 Dependencies Similar syntax OpenMP does not support strided memory Target directive The OpenMP target has a different meaning In OpenMP, the device compiler will be called to generate code 6

7 OmpSs Programming Model OmpSs execution model Dataflow programming model that works well with: Large and complex legacy applications Incremental parallelization methodology Programmer provides local information, runtime manages global application New or legacy applications that requires acceleration Nice support for OpenCL/CUDA kernels View of a single memory address space 7

8 OmpSs Programming Model void foo ( int *a, int *b ) { for ( i = 1; i < N; i++ ) { #pragma omp task in(a[i-1]) inout(a[i]) out(b[i]) propagate(&a[i-1],&a[i],&b[i]); #pragma omp task in(b[i-1]) inout(b[i]) correct(&b[i-1],&b[i]); } #pragma omp taskwait } 8

9 OmpSs Programming Model Tool-chain Mercurium: Source-to-source compiler Supports Fortran, C and C++ Nanos++: Common execution runtime (C, C++ and Fortran) Task creation, dependency management, task scheduling... 9

10 Hybrid MPI+OmpSs

11 Hybrid MPI+OmpSs Why? MPI is here to stay A lot of HPC applications already written in MPI MPI scales well to tens/hundreds of thousands of nodes MPI exploits inter-node parallelism while OmpSs exploits node parallelism Overlap of communication and computation through the inclusion of communication in the task graph MPI/OmpSs overcomes the too synchronous structure of MPI/OpenMP, propagating the dataflow asynchronous behavior to the MPI level 11

12 Hybrid MPI+OmpSs Execution time (lower = better) Performance Higher at smaller problem sizes Improved Load balance (less processes) Higher IPC Overlap communication/computation Tolerance to bandwidth and OS noise Performance (higher = better) Execution time (lower = better) 12

$vsize*l; if(i%2 == 0) { mxm ( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); MPI_Sendrecv (a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, comm, &stats ); } else { mxm (lda,$

13 Hybrid MPI+OmpSs: Matrix multiplication (MPI) for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/bs)*ndim*bs*bs); size = vsize*l; if(i%2 == 0) { mxm ( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); MPI_Sendrecv (a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, comm, &stats ); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); MPI_Sendrecv (rbuf, size, MPI_DOUBLE, down, stag, a, size, MPI_DOUBLE, up, rtag, comm, &stats ); } } C A B 8 nodes, 1 proc per node rbuf void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ) { double alpha=1.0, beta=1.0; int i, j; char tr = 't'; dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m); } 13

Hybrid MPI+OmpSs Typical hybrid parallelization: Parallelize computation phase Serialize communication part How to overlap communication and

14 Hybrid MPI+OmpSs Typical hybrid parallelization: Parallelize computation phase Serialize communication part How to overlap communication and computation? Using double buffering and Asynchronous MPI calls 2 processes per node. Easier with OmpSs... No computation/communication overlap 14

$Hybrid MPI+OmpSs: Matmul orig_rbuf = rbuf = (double (*)[m])malloc(m*n*sizeof(double)); if (nodes >1) { up = me<nodes-1? me+1:0; down = me>0?$ $B[0:m-1])\ inout (C[i:i+n-1][0:n-1]) firstprivate (n,m) cblas_dgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, n, n, m, 1.0, (double *)a, m, (double *)B, n, 1.$

15 Hybrid MPI+OmpSs: Matmul orig_rbuf = rbuf = (double (*)[m])malloc(m*n*sizeof(double)); if (nodes >1) { up = me<nodes-1? me+1:0; down = me>0? me-1:nodes-1; } else { up = down = MPI_PROC_NULL; } a=a; i = n*me; // first C block (different for each process) size = m*n; for( it = 0; it < nodes; it++ ) { #pragma omp task in (a[0:n-1], B[0:m-1])\ inout (C[i:i+n-1][0:n-1]) firstprivate (n,m) cblas_dgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, n, n, m, 1.0, (double *)a, m, (double *)B, n, 1.0, (double *)&C[i][0], n); if (it < nodes-1) { #pragma omp task in (a[0:n-1]) out (rbuf[0:n-1]) inout(stats) \ firstprivate (size,m,n,tag,down,up) MPI_Sendrecv( a, size, MPI_DOUBLE, down, tag, rbuf, size, MPI_DOUBLE, up, tag, MPI_COMM_WORLD, &stats ); } i = (i+n)%m; ptmp=a; a=rbuf; rbuf=ptmp; //next C block circular //swap pointers C A B } #pragma omp taskwait free (orig_rbuf); rbuf 15

16 Hybrid MPI+OmpSs Problems taskifying communications: Possibly several concurrent MPI calls Need to use thread-safe MPI Reordering of MPI calls potential source of deadlocks Need to control order of communication tasks Use of sentinels if necessary 16

17 OpenCL

management (different address spaces) Memory allocation Data transfers

18 OpenCL in OmpSs Two main alternatives CUDA/OpenCL (very similar) Accelerator language (CUDA C/OpenCL C) CUDA/OpenCL Host API Explicit data management (different address spaces) Memory allocation Data transfers Kernel management (compilation, execution, synchronization ) Host memory Device memory 18

19 OpenCL in OmpSs Philosophy: C, C++ and Fortran with OpenCL and CUDA supported Based/compatible with OpenMP 3.0 Write sequential programs an run them in parallel Support most of the OpenMP 3.0 annotations Extend OpenMP with function-tasks and parameter annotations Provide dynamic parallelism and automatic dependency management BLAS kernel, OmpSs C version #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y) { for (i=0; i < size; i++) y[i] = a * x[i] + y[i]; } 19

20 OpenCL in OmpSs: OpenCL example cl_mem d_x = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); cl_mem d_y = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); const char *KernelSrc = "\n" \ " kernel void saxpy(int n, float a, \n" \ " global float* x, \n" \ " global float* y) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < n) \n" \ " y[i] = a * x[i] + y[i]; \n" \ "} clenqueuewritebuffer(cmd, d_x, CL_TRUE, 0, sizeof(float) * n, h_x, 0, 0, 0); clenqueuewritebuffer(cmd, d_y, CL_TRUE, 0, sizeof(float) * n, h_y, 0, 0, 0); clsetkernelarg(ko_saxpy, 0, 4, &n); clsetkernelarg(ko_saxpy, 1, 4, &a); size_t sze = sizeof(cl_mem); clsetkernelarg(ko_saxpy, 2, sze, &d_x); clsetkernelarg(ko_saxpy, 3, sze, &d_y); int main(int argc, char** argv) { float a, h_x[1024], h_y[1024]; // Init a, h_x and h_y; cl_uint numplats; clgetplatformids(0, 0, &numplats); cl_platform_id Plat[numPlats]; clgetplatformids(numplats, Plat, 0); clgetdeviceids(plat[i], DEV, 1, &id,0); cl_context ctx = clcreatecontext(0, 1, &id, 0, 0, 0); cl_command_queue cmd = clcreatecommandqueue(ctx, id, 0, 0); cl_program program = clcreateprogramwithsource(ctx, 1, KernelSrc, 0, 0); clbuildprogram(program, 0, 0, 0, 0, 0); cl_kernel ko_saxpy = clcreatekernel( program, "saxpy", 0); saxpy.c size_t global = 1024, local = 128; clenqueuendrangekernel(cmd, ko_saxpy, 1, 0, &global, &local, 0, 0, 0); clfinish(commands); clenqueuereadbuffer( commands, d_y, CL_TRUE, 0, 4 * n, h_y, 0, 0, 0 ); printf("%f, h_y[0]"); clreleasememobject(d_x); clreleasememobject(d_y); clreleaseprogram(program); clreleasekernel(ko_saxpy); clreleasecommandqueue(commands); clreleasecontext(ctx); return 0; saxpy.c } 20

21 OpenCL in OmpSs: OpenCL example Se rio us ly??? cl_mem d_x = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); cl_mem d_y = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); const char *KernelSrc = "\n" \ " kernel void saxpy(int n, float a, \n" \ " global float* x, \n" \ " global float* y) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < n) \n" \ " y[i] = a * x[i] + y[i]; \n" \ "} clenqueuewritebuffer(cmd, d_x, CL_TRUE, 0, sizeof(float) * n, h_x, 0, 0, 0); clenqueuewritebuffer(cmd, d_y, CL_TRUE, 0, sizeof(float) * n, h_y, 0, 0, 0); clsetkernelarg(ko_saxpy, 0, 4, &n); clsetkernelarg(ko_saxpy, 1, 4, &a); size_t sze = sizeof(cl_mem); clsetkernelarg(ko_saxpy, 2, sze, &d_x); clsetkernelarg(ko_saxpy, 3, sze, &d_y); int main(int argc, char** argv) { float a, h_x[1024], h_y[1024]; // Init a, h_x and h_y; cl_uint numplats; clgetplatformids(0, 0, &numplats); cl_platform_id Plat[numPlats]; clgetplatformids(numplats, Plat, 0); clgetdeviceids(plat[i], DEV, 1, &id,0); cl_context ctx = clcreatecontext(0, 1, &id, 0, 0, 0); cl_command_queue cmd = clcreatecommandqueue(ctx, id, 0, 0); cl_program program = clcreateprogramwithsource(ctx, 1, KernelSrc, 0, 0); clbuildprogram(program, 0, 0, 0, 0, 0); cl_kernel ko_saxpy = clcreatekernel( program, "saxpy", 0); saxpy.c size_t global = 1024, local = 128; clenqueuendrangekernel(cmd, ko_saxpy, 1, 0, &global, &local, 0, 0, 0); clfinish(commands); clenqueuereadbuffer( commands, d_y, CL_TRUE, 0, 4 * n, h_y, 0, 0, 0 ); printf("%f, h_y[0]"); clreleasememobject(d_x); clreleasememobject(d_y); clreleaseprogram(program); clreleasekernel(ko_saxpy); clreleasecommandqueue(commands); clreleasecontext(ctx); return 0; saxpy.c } 21

22 OpenCL in OmpSs: example Easy integration with CUDA/OpenCL kernels #pragma omp target device(cuda opencl) ndrange(dim, size, block_size) Identifies the following function as CUDA C/OpenCL C kernel Provides the number of dimensions, total size and block size #pragma omp task in( ) out( ) Specifies input/output as usual Automatic management of memory Device memory allocation Host/device data transfer Automatic setup of OpenCL kernels On-demand kernel compilation Transparent setting of parameters Data-flow synchronization kernel void saxpy(int n, float a, global float* x, global float* y) { int i = get_global_id(0); if(i < n) y[i] = a * x[i] + y[i]; } saxpy.cl #pragma omp target device(opencl) \ ndrange(1, n, 128) copy_deps #pragma omp task in([n]x) inout([n]y) kernel void saxpy(int n, float a, global float* x, global float* y); #define N 1024 int main(int argc, char* argv[]) { float a, x[n], y[n]; saxpy(n, a, x, y); #pragma omp taskwait print_result(y, N); return 0; } saxpy.c 22

23 OpenCL in OmpSs: target directive Specific device information: the target directive Always attached to the task directive Mandatory use of device clause #pragma omp omp target target device device (type) (type) #pragma [copy_deps] [[ copy_in(...) copy_in(...) ]] [[ copy_out(...) copy_out(...) ]] [[ copy_inout copy_inout (...) (...) ]] }} {{ [copy_deps] ndrange( ) ]] [[ file( ) file( ) ]] [[ name( ) name( ) ]] [[ ndrange( ) #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma structured-block }} {{ structured-block Device clause device(type): specific device to run the following task Current supported devices: smp, opencl, cuda, cluster, fpga, mpi Default 23

24 OpenCL in OmpSs: target clauses Target Directive (and copy clauses) #pragma omp omp target target device(type) device(type) [[ copy_in( ) copy_in( ) ]] [[ copy_out( ) copy_out( ) ]] [[ copy_inout( ) copy_inout( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma copy_in(var-list): requesting a consistent copy of variables copy_out(var-list): producing next version of variable copy_inout(var-list): combination of in and out Target Directive (and copy_deps clause) #pragma omp omp target target device(type) device(type) [[ copy_deps copy_deps ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] [[ in( ) in( ) ]] [[ out( ) out( ) ]] [[ inout( ) inout( ) ]] #pragma copy_deps: data copy information is in the dependence clauses In( ) copy_in( ) out copy_out inout copy_inout( ) 24

25 OpenCL in OmpSs: dimension clauses Target Directive (and dimension clauses) #pragma omp omp target target device(type) device(type) [[ ndrange( ) ndrange( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma ndrange(ndim,g1[, Gndim],L1[,...Lndim]): kernel execution resources ndim 1, 2 or 3 number of dimensions Gi global size for dimension i (where 1 i ndim) Li local size for dimension i (where 1 i ndim) 25

26 OpenCL in OmpSs: file and name clauses Compiling kernels at runtime (e.g. OpenCL) Target Directive (and file clause) #pragma omp omp target target device(type) device(type) [[ file( ) file( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma file(file-name): file containing kernel code Target Directive (and name clause) The source file must be passed to the compiler #pragma omp omp target target device(type) device(type) [[ name( ) name( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma name(kernel-name): kernel name not required for C/C++ Fortran code is not case sensitive (matmul = MatMul = MATMUL) Forcing fortran s user to call an specific kernel 26

27 OpenCL in OmpSs: shared memory clause Specifies the amount of shared memory allocated for the kernel Target Directive (CUDA) #pragma omp omp target target device(type) device(type) [[ shmem( shmem( expr) expr) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma expr: an expression with the number of bytes to reserve Target Directive (OpenCL) #pragma omp omp target target device(type) device(type) [[ shmem( shmem( expr1[, expr1[, expr-n expr-n...]...] )) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma Argument is a list of expressions for each parameter 27

28 OpenCL in OmpsS: MPI OmpSs perfectly integrates CUDA and OpenCL with MPI #pragma omp target device(cuda) copy_deps ndrange(1, last-first, 128 ) #pragma omp task in(particle_array[0;particles]) out([part_per_process]particle_array2) Particle_array_calculate_forces_cuda(Particle* particle_array, Particle* particle_array2, int particles, Int part_per_process, float time_interval, int first, int last); Particle_broadcast_arguments(); Particle_array_initialize(particle_array, particles); for (int i = 1; i <= timesteps; i++) { Particle_array_calculate_forces_cuda(particle_array, particle_array2, particles, part_per_process, time_interval, first, last); #pragma omp target device(smp) copy_deps #pragma omp task out(particle_array[0;particles]) in(particle_array2[first:last]) MPI_Allgather(particle_array, part_per_process*8, MPI_FLOAT, particle_array2, part_per_process*8, MPI_FLOAT, MPI_COMM_WORLD); } /* for timestep */ #pragma omp taskwait 28

29 OpenCL in OmpsS: Fortran support Fortran code can call CUDA or OpenCL kernels like C functions kernel void cstructfac(int na, int nr, int nc, float f2, int NA, global float4* a, int NH, global float4* h,int NE, global float4* E, local float4* ashared){ int i = get_global_id(0); if (i<na) {..}; } Global size/s krist.cl Local size/s INTERFACE!$OMP TARGET DEVICE(OPENCL) COPY_DEPS NDRANGE(1, NR, 128) SHMEM( ) FILE(krist.cl)!$OMP TASK IN (A, H) OUT(E) SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE, E) main.f90 INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE) Shared memory END SUBROUTINE CSTRUCTFAC Dim:1, 2 or 3 size/s END INTERFACE Input/Outputs [...] DO II = 1, NR, NR_2 IND_H = (DIM2_H * (II - 1)) + 1 IND_E = (DIM2_E * (II - 1)) + 1 CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, & DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),& DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1)) END DO [...] Normal Fortran subroutine call 29

30 OpenCL in OmpsS: performance and scalability Performance/programmability trade-off: Performance competitive with hand-written code. Lower number of code lines and API calls. Matmul and Krist: lower LoC and API calls Matmul: better performance 30

31 OpenCL in OmpsS: performance and scalability Scalability results competitive with hand-written code Transparently uses the number of available devices If there is enough parallelism 31

32 Conclusions

33 Conclusions OmpSs is a programming model that enables Incremental parallelization of sequential code Data-flow execution model (asynchronous) Nicely supports heterogeneous environments Many optimizations under the hood: Advanced scheduling policies Work stealing/load balancing Data prefetching OmpSs is open source Have a look at 33

34 Thank you! For further information please contact 34

Tutorial OmpSs: Overlapping communication and computation

www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00