Programming with

Size: px

Start display at page:

Download "Programming with"

Jason Williamson
5 years ago
Views:

1 Programming with Xavier Martorell, Rosa M. Badia Computer Sciences Research Dept. BSC PATC Parallel Programming Workshop July 1-2, 2015

2 Agenda Motivation Leveraging OpenCL and CUDA Examples Contact: Source code available from

3 Motivation OpenCL/CUDA coding, complex and error-prone Memory allocation Data copies to/from device memory Manual work scheduling Code and data management from the host h devh

4 Motivation Memory allocation Need to have a double memory allocation Host memory h = (float*) malloc(sizeof(*h)*dim2_h*nr); Device memory r = cudamalloc((void**)&devh,sizeof(*h)*nr*dim2_h); Different data sizes due to blocking make the code confusing h devh

5 Motivation Data copies to/from device memory copy_in/copy_out cudamemcpy(devh,h,sizeof(*h)*nr*dim2_h, cudamemcpyhosttodevice); Increased options for data overwrite compared to homogeneous programming h devh

6 Motivation Complex code/data management from the host Main.c // Initialize device, context, and buffers... memobjs[1] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srcb, NULL); // create the kernel kernel = clcreatekernel (program, dot_product, NULL); // set the args values err = clsetkernelarg (kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]); err = clsetkernelarg (kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]); err = clsetkernelarg (kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]); // set work-item dimensions global_work_size[0] = n; local_work_size[0] = 1; // execute the kernel err = clenqueuendrangekernel (cmd_queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); // read results err = clenqueuereadbuffer (cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);... kernel.cl kernel void dot_product ( global const float4 * a, global const float4 * b, global float4 * c) { int gid = get_global_id(0); c[gid] = dot(a[gid], b[gid]); }

7 Proposal: OmpSs OpenMP expressiveness Tasking StarSs expressiveness Data directionality hints (in/out/inout) Detection of dependencies at runtime Automatic data movement CUDA Leverage existing kernels

8 OmpSs: execution model Thread-pool model OpenMP parallel ignored All threads created on startup One of them (SMP) executes main... and tasks P-1 workers (SMP) execute tasks One representative (SMP to OpenCL/CUDA) per device Experimenting with a single repr. for N devices All get work from a task pool Work is labeled with possible targets Tasks with several targets are scheduled to different devices at the same time

OmpSs: memory model A single global address space The runtime system takes care of the devices/local memories SMP machines: no need for extra runtime support

9 OmpSs: memory model A single global address space The runtime system takes care of the devices/local memories SMP machines: no need for extra runtime support Distributed/heterogeneous environments Multiple physical address spaces exist Versions of the same data can reside on them Data consistency ensured by the runtime system

10 OmpSs: Task Syntax Based on tasks with data directionality hints: in/out/inout/concurrent Task dependences computed based on the data specified To compute dependences To allow concurrent execution of concurrent/commutative tasks #pragma omp task [ in (array_spec...)] [ out (...)] [ inout (...)] [ concurrent ( )] [commutative(...)] \ [ priority(p) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [ on (...) ] Master waits for sons or specific data availability Defaults: data is shared

11 OmpSs: Recommended Task Syntax Based on tasks with data directionality hints: in/out/inout/concurrent Task dependences computed based on the data specified To compute dependences To allow concurrent execution of concurrent/commutative tasks #pragma omp task [ depend(in: array_spec...)] [ depend(out:...)] [ depend(inout:...)] \ [ depend(concurrent: )] [depend (commutative:...)] \ [ priority(p) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [ on (...) ] Master waits for sons or specific data availability Defaults: data is shared

12 OmpSs: Device Syntax Target directive Task implementation for a device The compiler configures the kernel with ndrange #pragma omp target device ({ smp opencl cuda }) \ Support for multiple implementations of a task [ implements ( function_name )] \ [ copy_deps no_copy_deps ] [ copy_in ( array_spec,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, )] [shmem(...) ] [file(name)] [name(name)] Ask the runtime to ensure consistent data is accessible in the address space of the device #pragma omp task [ in (...)] [ out (...)] [ inout (...)] [ concurrent (...)] [ commutative (...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [on ( )] [noflush] Possibility to wait for specific data, generated from one or several tasks Possibility to wait for tasks, but not necessarily data transfers Defaults: data is shared consistency is copy_deps

13 OmpSs: Device Syntax with Recommended Syntax for Tasks Target directive Task implementation for a device The compiler configures the kernel with ndrange #pragma omp target device ({ smp opencl cuda }) \ Support for multiple implementations of a task [ implements ( function_name )] \ [ copy_deps no_copy_deps ] [ copy_in ( array_spec,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, )] [shmem(...) ] [file(name)] [name(name)] Ask the runtime to ensure consistent data is accessible in the address space of the device #pragma omp task [ depend(in: array_spec...)] [ depend(out:...)] [ depend(inout:...)] \ [ depend(concurrent: )] [depend (commutative:...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression) {code block or function} #pragma omp taskwait [on ( )] [noflush] Possibility to wait for specific data, generated from one or several tasks Possibility to wait for tasks, but not necessarily data transfers Defaults: data is shared consistency is copy_deps

14 Heterogeneity: the copy clauses #pragma omp target [ clauses ] #pragma omp task [ clauses ] Some devices (opencl, cuda) have their private physical address space Data must be maintaned consistent between the original address space of the program and the address space of the device copy_deps: By default, all data referenced in in, out, inout clauses is kept consistent The copy_in, copy_out, an copy_inout clauses have to be used to specify any other data that needs to be maintained consistent The copy_no_deps is a shorthand to specify that for each in/out/inout declaration, there is no need to copy the data Tasks on the program host device (smp) also have to specify directionality to ensure consistency for those arguments referenced in some other device(s) The default taskwait semantic is to ensure consistency of all the data in the original program address space

15 Heterogeneity: the copy clauses copy_in: requests a consistent copy of the data to be available in the device address space when the task is executed It may require a data transfer Automatically performed by the runtime Scheduled at its convenience Data may be already in the device The runtime keeps track of all data replication and ensures consistency Address translation may be needed copy_out: specifies that the task produces that data The runtime has to use this information to ensure data consistency May require a transfer sometime after the task completes Automatically performed by the runtime, at its convenience copy_inout: specifies that a consistent copy of the data is required on input and a new value is produced for which consistency has to be maintained

16 Heterogeneity: the OpenCL/CUDA information clauses ndrange: provides the configuration for the OpenCL/CUDA kernel ndrange ( ndim, {global/grid}_array, {local/block}_array ) ndrange ( ndim, {global grid}_dim1, {local block}_dim1, ) 1 to 3 dimensions are valid values can be provided through 1-, 2-, 3-elements arrays (global, local) two lists of 1, 2, or 3 elements, matching the number of dimensions values can be function arguments or globally accessible variables file: provides the file containing the OpenCL/CUDA source code for the specific kernel file ( filename ) Can have several kernels per file Kernel selected by the function name name: names the kernel to be invoked name ( opencl-kernel-name ) Useful in FORTRAN for capitalization of names

17 matmul #define BLOCK_SIZE 16 constant int BL_SIZE= BLOCK_SIZE; NB DIM NB DIM NB #pragma omp target #pragma omp task kernel void Muld( device(opencl) copy_deps \ ndrange(2,nb,nb,bl_size,bl_size) file (muld.cl) in([nb*nb]a,[nb*nb]b) \ inout([nb*nb]c) global REAL* A, Use global for global REAL* B, int wa, int wb, copy_in/copy_out global REAL* C, int NB); arguments NB void matmul( int m, int l, int n, int mdim, int ldim, int ndim, REAL **tilea, REAL **tileb,real **tilec ) { int i, j, k; for(i = 0;i < mdim; i++) for (k = 0; k < ldim; k++) for (j = 0; j < ndim; j++) Muld(tileA[i*lDIM+k], tileb[k*ndim+j],nb,nb, tilec[i*ndim+j],nb); }

18 matmul kernel #include "matmul_auxiliar_header.h" // defines BLOCK_SIZE // Device multiplication function // Compute C = A * B muld.cl // wa is the width of A // wb is the width of B kernel void Muld( global REAL* A, global REAL* B, int wa, int wb, global REAL* C, int NB) { // Block index, Thread index int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); // Indexes of the first/last sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE;...

19 Julia int LGWSIZE=16; #pragma omp target device(opencl) copy_deps \ ndrange(2, jc.window_size[0]/4, jc.window_size[1], \ LGWSIZE, 1) file (julia.cl) #pragma omp task out(framebuffer[0;jc.window_size[0] * jc.window_size[1]]) kernel void compute_julia(float mup0,float mup1,float mup2,float mup3, global uint32_t * framebuffer, struct julia_context jc); for (i = 0; i < iterations; i++) { getcurmu(currmu, morphtimer); morphtimer += 0.05f; compute_julia(currmu[0], currmu[1], currmu[2], currmu[3], framebuffer[other_frame(display_frame)], jc); }

20 NBody #pragma omp target device(opencl) no_copy_deps file (nbody.cl) \ copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out) \ ndrange(1, size, MAX_NUM_THREADS) #pragma omp task out([size] out) \ in(d_particles[0*size;1*size], \ d_particles d_particles[1*size;2*size], \ d_particles[2*size;3*size], \ d_particles[3*size;4*size], \ out d_particles[4*size;5*size], \ d_particles[5*size;6*size], \ d_particles[6*size;7*size], \ d_particles[7*size;8*size] ) kernel void calculate_force_func(int size, float time_interval, int number_of_particles, global Particle* d_particles, global Particle *out, int first_local, int last_local); for ( i = 0; i < number_of_particles; i += bs ) { calculate_force_func(bs, time_interval, number_of_particles, this_particle_array, &output_array[i], i, i+bs-1); }

21 Nbody SMP vs OpenCL kernels #pragma omp target device(smp) copy_deps #pragma omp task out([size] output) in([particles] d_particles) void calculate_force_smp(int size, float time_interval, int particles, Particle* d_particles, Particle *output, int first_local, int last_local) { #pragma omp target device(opencl) ndrange(1,size,128) copy_deps #pragma omp task out([size] output) in([particles] d_particles) kernel void calculate_force_ocl(int size, float time_interval, int particles, global Particle* part, global Particle *output, int first_local, int last_local) { size_t id = get_global_id(0)+first_local; if (id > last_local ) return; for (size_t id = first_local; id <= last_local; id++) { Particle* this_particle = output + id - first_local; global Particle* this_particle = output + id - first_local; float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f; float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f; for (int i = 0; i < particles; i++) { if (i!= id) { calculate_force(part + id, part + i, &force_x, &force_y, &force_z); for (int i = 0; i < particles; i++) { if (i!= id) { calculate_force(part + id, part + I, &force_x, &force_y, &force_z); total_force_x += force_x; total_force_y += force_y; total_force_z += force_z; total_force_x += force_x; total_force_y += force_y; total_force_z += force_z; } } } [...] } [ ] // } } }

22 OpenCL/CUDA device specifics The compiler generates a stub function task that invokes the kernel Using the information at ndrange and file clauses There is a host thread in the runtime representing each device The task body OpenCL/CUDA code is actually executed by that thread in the host That thread launches kernels Compiles, if necessary Creates buffers Sets kernel arguments Invokes kernel The runtime does the memory allocation and deallocation on the device as well as the data transfers for copy variables Different memory allocation / copy possibilities

23 Memory allocation / copy Regular memory allocation / copies p = malloc (SIZE);... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) kernel void vec_init ( global * p); Device memory allocation / using mapped memory p = nanos_opencl_malloc (SIZE);... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) kernel void vec_init ( global * p);

24 OmpSs: the implements clause Example of alternative implementations #pragma omp target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { int j; for (j=0; j < size; j++) b[j] = scalar*c[j]; } scale.c #pragma omp target device (cuda) copy_deps implements (scale_task) ndrange(1, size, 512) #pragma omp task input ([size] c) output ([size] b) void global scale_task_cuda(double *b, double *c, double scalar, int size); extern "C" { global void scale_task_cuda(double * b, double * c, double scalar, int size); }; void scale_task_cuda(double *b, double *c, double scalar, int size); { int j = blockidx.x * blockdim.x + threadidx.x; b[j] = scalar * c[j]; } scale_cuda.cu (compiled automatically)

25 Heterogeneity: the target directive #pragma omp target device (cuda) copy_deps Executed in the host #pragma omp task input ([size] c) output ([size] b) void scale_task_cuda(double *b, double *c, double scalar, int size) { dim3 dimblock, dimgrid; Kernel launched to the device. dimblock.x = 128; dimblock.y = dimblock.z = 1; b and c are pointers to the memory dimgrid.x = size/128+1; allocated by the runtime on the device scale_kernel<<<dimgrid,dimblock>>>(size, 1, b, c, scalar); } #pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ scale_task_cuda(a, B, 10.0, 1024); scale_task_cuda(b, A, 0.01, 1024); scale_task (C, A, 2.0, 1024); scale_task_cuda (D, E, 5.0, 1024); scale_task_cuda(b, C, 3.0, 1024); #pragma omp taskwait // can access any of A,B,C,D,E } A, B have to be transferred to device before task execution No data transfer. Will execute after T1 //T1 //T2 //T3 //T4 //T5 A has to be transferred to host. Can be done in parallel with T2 D, E, have to be transferred to GPU. Can be done at the very beginning C has to be transferred to GPU. Can be done when T3 finishes Copy D, E back to host

26 Heterogeneity: relaxing consistency No copy of consistent instance of input to smp. Definition of new value for output #pragma target device (smp) no_copy_deps copy_out ([size] b) #pragma omp task input ([size] c) output ([size] b) void scale_task_nocpin (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } A, B have to be transferred to device before task execution No data transfer. Will execute after T1 double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ scale_task_cuda (A, B, 1.0, 1024); scale_task_cuda (B, A, 0.1, 1024); scale_task_nocpin (C, A, 2.0, 1024); scale_task_cuda (D, E, 5.0, 1024); scale_task_cuda (C, B, 3.0, 1024); A, is not transferred to host. T3 will execute after T1 finishes but with old value of A //T1 //T2 //T3 //T4 //T5 D, E, have to be transferred to GPU. Can be done at the very beginning C value updated by T3 has to be transferred to GPU. Can be done when T3 finishes. #pragma omp taskwait noflush No copy of A, B, C, D, E back to host printf ( %f, %f\n, D[1],E[512] scale_task (E, B, 0.5, 1024); //T6 scale_task_nocpin (D, C, 0.5, 1024); //T7 Will print original unmodified value of D and E Copy B and E back to host before execution #pragma omp taskwait noflush } No copy of C back to host No copy of D, back to host but new value defined by T7

27 OmpSs runtime features Automatic and configurable overlapping of data transfers and computation In / out / inout with the kernel executions Data prefetching Transfer input data to GPU memory well before tasks need the data

28 OmpSs runtime features Configurable data cache policy Write-through Write-back after the execution of each task, output data is updated on the host only updates host data if needed Nocache cache is disabled not recommended, for testing only Configurable limit on GPU memory usage

29 OmpSs runtime features CUBLAS support and context management No need to call cublasinit / cublasshutdown NX_GPUCUBLASINIT CUBLAS functions can also be scheduled to specific streams to enable overlapping Runtime takes care of properly allocation of streams/handles

30 OmpSs Perlin Noise 1024 j=0 j=1 j= #pragma omp target device(cuda) \ ndrange(2,img_width,bs,128,1) file (perlin.cu) #pragma omp task out (output[0:bs*rowstride-1]) global void cuda_perlin (pixel output [ ], float time, int j, int rowstride,int img_width, int BS);

31 Mercurium compiler Transforming Directives to runtime calls Compiling Natively with the Nvidia compiler C/C++/Fortran Embedding Object files with CUDA embedded on host files

32 Execution options (SMP & generic) NX_PES=<P> # number of SMP cores to use # scheduling policy NX_SCHEDULE=<bf dbf wf socket versioning> versioning checks performance and decides how to use devices # additional arguments to pass to Nanos++ NX_ARGS="..." # enabling priority support NX_ARGS="--schedule-priority" NX_ARGS="--schedule-smart-priority"

33 Execution options (SMP & generic) # Caching mechanism NX_CACHE_POLICY=<wt wb nocache> # cache policy to use # Generating Paraver traces NX_INSTRUMENTATION=extrae # enables instrumentation EXTRAE_CONFIG_FILE=xmlfile # sets the XML file for Extrae to use # Disabling CUDA/OpenCL support NX_ARGS=" --disable-cuda --disable-opencl "

34 Execution options (OpenCL) NX_OPENCL_MAX_DEVICES=<N> # number of devices (GPUs, MICs NX_OPENCL_DEVICE_TYPE=<GPU CPU> # type of device to use NX_OPENCL_CACHE_POLICY=<wb wt nocache> NX_DISABLEOPENCL=<yes no>

35 Execution options (CUDA) NX_GPUS=<N> # number of devices (GPUs) NX_GPUPREFETCH=<yes no> # enable prefetch of task data NX_GPUOVERLAP=<yes no> # enable overlapping of transfers NX_GPUOVERLAP_INPUTS=<yes no> # and computation NX_GPUOVERLAP_OUTPUTS=<yes no> NX_GPUMAXMEM=<M> # amount of memory to preallocate NX_GPUCUBLASINIT=yes # initialize CUBLAS Check output of $ nanox --help

36 OmpSs: Syntax in FORTRAN!$OMP TARGET DEVICE({SMP CUDA OPENCL MPI})& [NDRANGE ( )]& [IMPLEMENTS (function_name)]& {[COPY_DEPS][NO_COPY_DEPS [COPY_IN ( )][COPY_OUT ( )][COPY_INOUT(..)]}& [SHMEM ( )][FILE ( )][NAME ( )]!$OMP TASK [IN ( )] [OUT(...)] [INOUT ( )] [CONCURRENT ( )][COMMUTATIVE ( )] & [PRIORITY ( )][LABEL ( )] & [SHARED( )][PRIVATE( )][FIRSTPRIVATE( )][DEFAULT ( )] & [UNTIED][IF(EXPRESSION)][FINAL( )] { code or function }!$OMP TASKWAIT [ON( )][NOFLUSH] Defaults: data is shared consistency is copy_deps

37 OmpSs: Recommended Syntax in FORTRAN!$OMP TARGET DEVICE({SMP CUDA OPENCL MPI})& [NDRANGE ( )]& [IMPLEMENTS (function_name)]& {[COPY_DEPS][NO_COPY_DEPS [COPY_IN ( )][COPY_OUT ( )][COPY_INOUT(..)]}& [SHMEM ( )][FILE ( )][NAME ( )]!$OMP TASK [DEPEND(IN: )] [DEPEND(OUT:...)] [DEPEND(INOUT: )] & [DEPEND(CONCURRENT: )] [DEPEND(COMMUTATIVE: )] & [PRIORITY ( )][LABEL ( )] & [SHARED( )][PRIVATE( )][FIRSTPRIVATE( )][DEFAULT ( )] & [UNTIED][IF(EXPRESSION)][FINAL(EXPRESSION)] { code or function }!$OMP TASKWAIT [ON( )][NOFLUSH] Defaults: data is shared consistency is copy_deps

38 OmpSs: syntax in FORTRAN Pragmas attached to the function declarations Preferably at INTERFACES INTERFACE!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, NR, 128) FILE(krist.cl) COPY_DEPS!$OMP TASK IN (A, H) OUT(E) SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE, E) INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE) END SUBROUTINE CSTRUCTFAC END INTERFACE Invocation DO II = 1, NR, NR_2 IND_H = (DIM2_H * (II - 1)) + 1 IND_E = (DIM2_E * (II - 1)) + 1 CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, & DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),& DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1)) END DO

39 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I IMPLICIT NONE DO I=1,N INTERFACE VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ VEC2(I) = N+1-I FILE(kernel.cl) COPY_DEPS RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM END INTERFACE INTEGER, PARAMETER :: N = 20 VEC1 and VEC2 are INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I sent to the GPU. No output transfer (write back) input data is already in the GPU. No input/output transfer RESULTS copied out from the GPU CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS END PROGRAM P!!! Expected IN/OUT transfers: IN = 160B OUT = 80Bwith OmpSs@CUDA/OpenCL Programming

40 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I IMPLICIT NONE DO I=1,N INTERFACE kernel void vec_sum(int n, global int* a, global int* b, global int* res) VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ VEC2(I) = N+1-I { FILE(kernel.cl) COPY_DEPS const int idx = get_global_id(0); RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE if (idx < n) IMPLICIT NONE { res[idx] = a[idx] +INTEGER, b[idx]; VALUE :: N INTEGER :: A(N), B(N), RES(N) } END SUBROUTINE VEC_SUM } END INTERFACE INTEGER, PARAMETER :: N = 20 VEC1 and VEC2 are INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I sent to the GPU. No output transfer (write back) input data is already in the GPU. No input/output transfer RESULTS copied out from the GPU CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS END PROGRAM P!!! Expected IN/OUT transfers: IN = 160B OUT = 80Bwith OmpSs@CUDA/OpenCL Programming

41 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N INTEGER :: VEC1(N), VEC2(N), RESULTS(N), PROGRAM P I DO I=1,N IMPLICIT NONE VEC1(I) = I INTERFACE VEC2(I) = N+1-I!$OMP TARGET DEVICE(OPENCL) \ RESULTS(I) = -1 NDRANGE(1, N, 128) FILE(kernel.cl) COPY_DEPS END DO!$OMP TASK IN(A, B) OUT(RES) END SUBROUTINE INITIALIZE SUBROUTINE VEC_SUM(N, A, B, RES) VEC1 and VEC2 are sent to the GPU. No output transfer (write back) RESULTS copied out from the GPU and removed VEC1 and VEC2 are sent to the GPU. No output transfer (write back) IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM END INTERFACE INTEGER, PARAMETER :: N = 20 INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS)!$OMP TASKWAIT PRINT *, "PARTIAL RESULTS: ", RESULTS CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT RESULTS copied PRINT *, "RESULTS: ", RESULTS out from the GPU END PROGRAM P removed and

42 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P I INTEGER :: VEC1(N), VEC2(N), RESULTS(N), IMPLICIT NONE DO I=1,N INTERFACE VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) \ VEC2(I) = N+1-I NDRANGE(1, N, 128) FILE(kernel.cl) COPY_DEPS RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM VEC1 and VEC2 are END INTERFACE sent to the GPU. INTEGER, PARAMETER :: N = 20 INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I No output transfer (write back) NO data copied out from the GPU Printed values are -1 All data already in the GPU. No output transfer CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS)!$OMP TASKWAIT NOFLUSH PRINT *, "PARTIAL RESULTS: ", RESULTS CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS RESULTS copied END PROGRAM P out from the GPU removed and

43 FORTRAN example, assuming write-back cache policy PROGRAM P VEC1 and VEC2 are IMPLICIT NONE INTERFACE sent to the GPU!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ No output transfer FILE(kernel.cl) COPY_DEPS!$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM(N, A, B, RES) IMPLICIT NONE INTEGER, VALUE :: N INTEGER, PARAMETER :: N = 20 INTEGER :: A(N), B(N), RES(N) INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I END SUBROUTINE VEC_SUM CALL INITIALIZE(N, VEC1, VEC2, RESULTS)!$OMP TARGET DEVICE(SMP) COPY_DEPS!$OMP TASK IN(BUFF) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) SUBROUTINE PRINT_BUFF(N, BUFF) CALL PRINT_BUFF(N, RESULTS) IMPLICIT NONE CALL VEC_SUM(N, VEC1, RESULTS, RESULTS) INTEGER, VALUE :: N INTEGER :: BUFF(N) END SUBROUTINE VEC_SUM CALL PRINT_BUFF(N, RESULTS) input data is END INTERFACE RESULTS copied to CPU and kept in GPU!$OMP TASKWAIT RESULT copied out from the GPU and removed from there already in the GPU No input transfer END PROGRAM P RESULTS copied to CPU and kept in GPU

44 Conclusions Future programming models should: Enable productivity and portability Support for heterogeneous/hierarchical architectures Support asynchrony global synchronization in systems with large number of nodes is not an answer anymore Be aware of data locality OmpSs is a proposal that enables: Incremental parallelization from existing sequential codes Data-flow execution model that naturally supports asynchrony Nicely integrates heterogeneity and hierarchy Support for locality scheduling Active and open source project:

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on