Programming with

Size: px
Start display at page:

Download "Programming with"

Transcription

1 Programming with Xavier Martorell, Rosa M. Badia Computer Sciences Research Dept. BSC PATC Parallel Programming Workshop July 1-2, 2015

2 Agenda Motivation Leveraging OpenCL and CUDA Examples Contact: Source code available from

3 Motivation OpenCL/CUDA coding, complex and error-prone Memory allocation Data copies to/from device memory Manual work scheduling Code and data management from the host h devh

4 Motivation Memory allocation Need to have a double memory allocation Host memory h = (float*) malloc(sizeof(*h)*dim2_h*nr); Device memory r = cudamalloc((void**)&devh,sizeof(*h)*nr*dim2_h); Different data sizes due to blocking make the code confusing h devh

5 Motivation Data copies to/from device memory copy_in/copy_out cudamemcpy(devh,h,sizeof(*h)*nr*dim2_h, cudamemcpyhosttodevice); Increased options for data overwrite compared to homogeneous programming h devh

6 Motivation Complex code/data management from the host Main.c // Initialize device, context, and buffers... memobjs[1] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srcb, NULL); // create the kernel kernel = clcreatekernel (program, dot_product, NULL); // set the args values err = clsetkernelarg (kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]); err = clsetkernelarg (kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]); err = clsetkernelarg (kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]); // set work-item dimensions global_work_size[0] = n; local_work_size[0] = 1; // execute the kernel err = clenqueuendrangekernel (cmd_queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); // read results err = clenqueuereadbuffer (cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);... kernel.cl kernel void dot_product ( global const float4 * a, global const float4 * b, global float4 * c) { int gid = get_global_id(0); c[gid] = dot(a[gid], b[gid]); }

7 Proposal: OmpSs OpenMP expressiveness Tasking StarSs expressiveness Data directionality hints (in/out/inout) Detection of dependencies at runtime Automatic data movement CUDA Leverage existing kernels

8 OmpSs: execution model Thread-pool model OpenMP parallel ignored All threads created on startup One of them (SMP) executes main... and tasks P-1 workers (SMP) execute tasks One representative (SMP to OpenCL/CUDA) per device Experimenting with a single repr. for N devices All get work from a task pool Work is labeled with possible targets Tasks with several targets are scheduled to different devices at the same time

9 OmpSs: memory model A single global address space The runtime system takes care of the devices/local memories SMP machines: no need for extra runtime support Distributed/heterogeneous environments Multiple physical address spaces exist Versions of the same data can reside on them Data consistency ensured by the runtime system

10 OmpSs: Task Syntax Based on tasks with data directionality hints: in/out/inout/concurrent Task dependences computed based on the data specified To compute dependences To allow concurrent execution of concurrent/commutative tasks #pragma omp task [ in (array_spec...)] [ out (...)] [ inout (...)] [ concurrent ( )] [commutative(...)] \ [ priority(p) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [ on (...) ] Master waits for sons or specific data availability Defaults: data is shared

11 OmpSs: Recommended Task Syntax Based on tasks with data directionality hints: in/out/inout/concurrent Task dependences computed based on the data specified To compute dependences To allow concurrent execution of concurrent/commutative tasks #pragma omp task [ depend(in: array_spec...)] [ depend(out:...)] [ depend(inout:...)] \ [ depend(concurrent: )] [depend (commutative:...)] \ [ priority(p) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [ on (...) ] Master waits for sons or specific data availability Defaults: data is shared

12 OmpSs: Device Syntax Target directive Task implementation for a device The compiler configures the kernel with ndrange #pragma omp target device ({ smp opencl cuda }) \ Support for multiple implementations of a task [ implements ( function_name )] \ [ copy_deps no_copy_deps ] [ copy_in ( array_spec,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, )] [shmem(...) ] [file(name)] [name(name)] Ask the runtime to ensure consistent data is accessible in the address space of the device #pragma omp task [ in (...)] [ out (...)] [ inout (...)] [ concurrent (...)] [ commutative (...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [on ( )] [noflush] Possibility to wait for specific data, generated from one or several tasks Possibility to wait for tasks, but not necessarily data transfers Defaults: data is shared consistency is copy_deps

13 OmpSs: Device Syntax with Recommended Syntax for Tasks Target directive Task implementation for a device The compiler configures the kernel with ndrange #pragma omp target device ({ smp opencl cuda }) \ Support for multiple implementations of a task [ implements ( function_name )] \ [ copy_deps no_copy_deps ] [ copy_in ( array_spec,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, )] [shmem(...) ] [file(name)] [name(name)] Ask the runtime to ensure consistent data is accessible in the address space of the device #pragma omp task [ depend(in: array_spec...)] [ depend(out:...)] [ depend(inout:...)] \ [ depend(concurrent: )] [depend (commutative:...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression) {code block or function} #pragma omp taskwait [on ( )] [noflush] Possibility to wait for specific data, generated from one or several tasks Possibility to wait for tasks, but not necessarily data transfers Defaults: data is shared consistency is copy_deps

14 Heterogeneity: the copy clauses #pragma omp target [ clauses ] #pragma omp task [ clauses ] Some devices (opencl, cuda) have their private physical address space Data must be maintaned consistent between the original address space of the program and the address space of the device copy_deps: By default, all data referenced in in, out, inout clauses is kept consistent The copy_in, copy_out, an copy_inout clauses have to be used to specify any other data that needs to be maintained consistent The copy_no_deps is a shorthand to specify that for each in/out/inout declaration, there is no need to copy the data Tasks on the program host device (smp) also have to specify directionality to ensure consistency for those arguments referenced in some other device(s) The default taskwait semantic is to ensure consistency of all the data in the original program address space

15 Heterogeneity: the copy clauses copy_in: requests a consistent copy of the data to be available in the device address space when the task is executed It may require a data transfer Automatically performed by the runtime Scheduled at its convenience Data may be already in the device The runtime keeps track of all data replication and ensures consistency Address translation may be needed copy_out: specifies that the task produces that data The runtime has to use this information to ensure data consistency May require a transfer sometime after the task completes Automatically performed by the runtime, at its convenience copy_inout: specifies that a consistent copy of the data is required on input and a new value is produced for which consistency has to be maintained

16 Heterogeneity: the OpenCL/CUDA information clauses ndrange: provides the configuration for the OpenCL/CUDA kernel ndrange ( ndim, {global/grid}_array, {local/block}_array ) ndrange ( ndim, {global grid}_dim1, {local block}_dim1, ) 1 to 3 dimensions are valid values can be provided through 1-, 2-, 3-elements arrays (global, local) two lists of 1, 2, or 3 elements, matching the number of dimensions values can be function arguments or globally accessible variables file: provides the file containing the OpenCL/CUDA source code for the specific kernel file ( filename ) Can have several kernels per file Kernel selected by the function name name: names the kernel to be invoked name ( opencl-kernel-name ) Useful in FORTRAN for capitalization of names

17 matmul #define BLOCK_SIZE 16 constant int BL_SIZE= BLOCK_SIZE; NB DIM NB DIM NB #pragma omp target #pragma omp task kernel void Muld( device(opencl) copy_deps \ ndrange(2,nb,nb,bl_size,bl_size) file (muld.cl) in([nb*nb]a,[nb*nb]b) \ inout([nb*nb]c) global REAL* A, Use global for global REAL* B, int wa, int wb, copy_in/copy_out global REAL* C, int NB); arguments NB void matmul( int m, int l, int n, int mdim, int ldim, int ndim, REAL **tilea, REAL **tileb,real **tilec ) { int i, j, k; for(i = 0;i < mdim; i++) for (k = 0; k < ldim; k++) for (j = 0; j < ndim; j++) Muld(tileA[i*lDIM+k], tileb[k*ndim+j],nb,nb, tilec[i*ndim+j],nb); }

18 matmul kernel #include "matmul_auxiliar_header.h" // defines BLOCK_SIZE // Device multiplication function // Compute C = A * B muld.cl // wa is the width of A // wb is the width of B kernel void Muld( global REAL* A, global REAL* B, int wa, int wb, global REAL* C, int NB) { // Block index, Thread index int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); // Indexes of the first/last sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE;...

19 Julia int LGWSIZE=16; #pragma omp target device(opencl) copy_deps \ ndrange(2, jc.window_size[0]/4, jc.window_size[1], \ LGWSIZE, 1) file (julia.cl) #pragma omp task out(framebuffer[0;jc.window_size[0] * jc.window_size[1]]) kernel void compute_julia(float mup0,float mup1,float mup2,float mup3, global uint32_t * framebuffer, struct julia_context jc); for (i = 0; i < iterations; i++) { getcurmu(currmu, morphtimer); morphtimer += 0.05f; compute_julia(currmu[0], currmu[1], currmu[2], currmu[3], framebuffer[other_frame(display_frame)], jc); }

20 NBody #pragma omp target device(opencl) no_copy_deps file (nbody.cl) \ copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out) \ ndrange(1, size, MAX_NUM_THREADS) #pragma omp task out([size] out) \ in(d_particles[0*size;1*size], \ d_particles d_particles[1*size;2*size], \ d_particles[2*size;3*size], \ d_particles[3*size;4*size], \ out d_particles[4*size;5*size], \ d_particles[5*size;6*size], \ d_particles[6*size;7*size], \ d_particles[7*size;8*size] ) kernel void calculate_force_func(int size, float time_interval, int number_of_particles, global Particle* d_particles, global Particle *out, int first_local, int last_local); for ( i = 0; i < number_of_particles; i += bs ) { calculate_force_func(bs, time_interval, number_of_particles, this_particle_array, &output_array[i], i, i+bs-1); }

21 Nbody SMP vs OpenCL kernels #pragma omp target device(smp) copy_deps #pragma omp task out([size] output) in([particles] d_particles) void calculate_force_smp(int size, float time_interval, int particles, Particle* d_particles, Particle *output, int first_local, int last_local) { #pragma omp target device(opencl) ndrange(1,size,128) copy_deps #pragma omp task out([size] output) in([particles] d_particles) kernel void calculate_force_ocl(int size, float time_interval, int particles, global Particle* part, global Particle *output, int first_local, int last_local) { size_t id = get_global_id(0)+first_local; if (id > last_local ) return; for (size_t id = first_local; id <= last_local; id++) { Particle* this_particle = output + id - first_local; global Particle* this_particle = output + id - first_local; float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f; float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f; for (int i = 0; i < particles; i++) { if (i!= id) { calculate_force(part + id, part + i, &force_x, &force_y, &force_z); for (int i = 0; i < particles; i++) { if (i!= id) { calculate_force(part + id, part + I, &force_x, &force_y, &force_z); total_force_x += force_x; total_force_y += force_y; total_force_z += force_z; total_force_x += force_x; total_force_y += force_y; total_force_z += force_z; } } } [...] } [ ] // } } }

22 OpenCL/CUDA device specifics The compiler generates a stub function task that invokes the kernel Using the information at ndrange and file clauses There is a host thread in the runtime representing each device The task body OpenCL/CUDA code is actually executed by that thread in the host That thread launches kernels Compiles, if necessary Creates buffers Sets kernel arguments Invokes kernel The runtime does the memory allocation and deallocation on the device as well as the data transfers for copy variables Different memory allocation / copy possibilities

23 Memory allocation / copy Regular memory allocation / copies p = malloc (SIZE);... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) kernel void vec_init ( global * p); Device memory allocation / using mapped memory p = nanos_opencl_malloc (SIZE);... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) kernel void vec_init ( global * p);

24 OmpSs: the implements clause Example of alternative implementations #pragma omp target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { int j; for (j=0; j < size; j++) b[j] = scalar*c[j]; } scale.c #pragma omp target device (cuda) copy_deps implements (scale_task) ndrange(1, size, 512) #pragma omp task input ([size] c) output ([size] b) void global scale_task_cuda(double *b, double *c, double scalar, int size); extern "C" { global void scale_task_cuda(double * b, double * c, double scalar, int size); }; void scale_task_cuda(double *b, double *c, double scalar, int size); { int j = blockidx.x * blockdim.x + threadidx.x; b[j] = scalar * c[j]; } scale_cuda.cu (compiled automatically)

25 Heterogeneity: the target directive #pragma omp target device (cuda) copy_deps Executed in the host #pragma omp task input ([size] c) output ([size] b) void scale_task_cuda(double *b, double *c, double scalar, int size) { dim3 dimblock, dimgrid; Kernel launched to the device. dimblock.x = 128; dimblock.y = dimblock.z = 1; b and c are pointers to the memory dimgrid.x = size/128+1; allocated by the runtime on the device scale_kernel<<<dimgrid,dimblock>>>(size, 1, b, c, scalar); } #pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ scale_task_cuda(a, B, 10.0, 1024); scale_task_cuda(b, A, 0.01, 1024); scale_task (C, A, 2.0, 1024); scale_task_cuda (D, E, 5.0, 1024); scale_task_cuda(b, C, 3.0, 1024); #pragma omp taskwait // can access any of A,B,C,D,E } A, B have to be transferred to device before task execution No data transfer. Will execute after T1 //T1 //T2 //T3 //T4 //T5 A has to be transferred to host. Can be done in parallel with T2 D, E, have to be transferred to GPU. Can be done at the very beginning C has to be transferred to GPU. Can be done when T3 finishes Copy D, E back to host

26 Heterogeneity: relaxing consistency No copy of consistent instance of input to smp. Definition of new value for output #pragma target device (smp) no_copy_deps copy_out ([size] b) #pragma omp task input ([size] c) output ([size] b) void scale_task_nocpin (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } A, B have to be transferred to device before task execution No data transfer. Will execute after T1 double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ scale_task_cuda (A, B, 1.0, 1024); scale_task_cuda (B, A, 0.1, 1024); scale_task_nocpin (C, A, 2.0, 1024); scale_task_cuda (D, E, 5.0, 1024); scale_task_cuda (C, B, 3.0, 1024); A, is not transferred to host. T3 will execute after T1 finishes but with old value of A //T1 //T2 //T3 //T4 //T5 D, E, have to be transferred to GPU. Can be done at the very beginning C value updated by T3 has to be transferred to GPU. Can be done when T3 finishes. #pragma omp taskwait noflush No copy of A, B, C, D, E back to host printf ( %f, %f\n, D[1],E[512] scale_task (E, B, 0.5, 1024); //T6 scale_task_nocpin (D, C, 0.5, 1024); //T7 Will print original unmodified value of D and E Copy B and E back to host before execution #pragma omp taskwait noflush } No copy of C back to host No copy of D, back to host but new value defined by T7

27 OmpSs runtime features Automatic and configurable overlapping of data transfers and computation In / out / inout with the kernel executions Data prefetching Transfer input data to GPU memory well before tasks need the data

28 OmpSs runtime features Configurable data cache policy Write-through Write-back after the execution of each task, output data is updated on the host only updates host data if needed Nocache cache is disabled not recommended, for testing only Configurable limit on GPU memory usage

29 OmpSs runtime features CUBLAS support and context management No need to call cublasinit / cublasshutdown NX_GPUCUBLASINIT CUBLAS functions can also be scheduled to specific streams to enable overlapping Runtime takes care of properly allocation of streams/handles

30 OmpSs Perlin Noise 1024 j=0 j=1 j= #pragma omp target device(cuda) \ ndrange(2,img_width,bs,128,1) file (perlin.cu) #pragma omp task out (output[0:bs*rowstride-1]) global void cuda_perlin (pixel output [ ], float time, int j, int rowstride,int img_width, int BS);

31 Mercurium compiler Transforming Directives to runtime calls Compiling Natively with the Nvidia compiler C/C++/Fortran Embedding Object files with CUDA embedded on host files

32 Execution options (SMP & generic) NX_PES=<P> # number of SMP cores to use # scheduling policy NX_SCHEDULE=<bf dbf wf socket versioning> versioning checks performance and decides how to use devices # additional arguments to pass to Nanos++ NX_ARGS="..." # enabling priority support NX_ARGS="--schedule-priority" NX_ARGS="--schedule-smart-priority"

33 Execution options (SMP & generic) # Caching mechanism NX_CACHE_POLICY=<wt wb nocache> # cache policy to use # Generating Paraver traces NX_INSTRUMENTATION=extrae # enables instrumentation EXTRAE_CONFIG_FILE=xmlfile # sets the XML file for Extrae to use # Disabling CUDA/OpenCL support NX_ARGS=" --disable-cuda --disable-opencl "

34 Execution options (OpenCL) NX_OPENCL_MAX_DEVICES=<N> # number of devices (GPUs, MICs NX_OPENCL_DEVICE_TYPE=<GPU CPU> # type of device to use NX_OPENCL_CACHE_POLICY=<wb wt nocache> NX_DISABLEOPENCL=<yes no>

35 Execution options (CUDA) NX_GPUS=<N> # number of devices (GPUs) NX_GPUPREFETCH=<yes no> # enable prefetch of task data NX_GPUOVERLAP=<yes no> # enable overlapping of transfers NX_GPUOVERLAP_INPUTS=<yes no> # and computation NX_GPUOVERLAP_OUTPUTS=<yes no> NX_GPUMAXMEM=<M> # amount of memory to preallocate NX_GPUCUBLASINIT=yes # initialize CUBLAS Check output of $ nanox --help

36 OmpSs: Syntax in FORTRAN!$OMP TARGET DEVICE({SMP CUDA OPENCL MPI})& [NDRANGE ( )]& [IMPLEMENTS (function_name)]& {[COPY_DEPS][NO_COPY_DEPS [COPY_IN ( )][COPY_OUT ( )][COPY_INOUT(..)]}& [SHMEM ( )][FILE ( )][NAME ( )]!$OMP TASK [IN ( )] [OUT(...)] [INOUT ( )] [CONCURRENT ( )][COMMUTATIVE ( )] & [PRIORITY ( )][LABEL ( )] & [SHARED( )][PRIVATE( )][FIRSTPRIVATE( )][DEFAULT ( )] & [UNTIED][IF(EXPRESSION)][FINAL( )] { code or function }!$OMP TASKWAIT [ON( )][NOFLUSH] Defaults: data is shared consistency is copy_deps

37 OmpSs: Recommended Syntax in FORTRAN!$OMP TARGET DEVICE({SMP CUDA OPENCL MPI})& [NDRANGE ( )]& [IMPLEMENTS (function_name)]& {[COPY_DEPS][NO_COPY_DEPS [COPY_IN ( )][COPY_OUT ( )][COPY_INOUT(..)]}& [SHMEM ( )][FILE ( )][NAME ( )]!$OMP TASK [DEPEND(IN: )] [DEPEND(OUT:...)] [DEPEND(INOUT: )] & [DEPEND(CONCURRENT: )] [DEPEND(COMMUTATIVE: )] & [PRIORITY ( )][LABEL ( )] & [SHARED( )][PRIVATE( )][FIRSTPRIVATE( )][DEFAULT ( )] & [UNTIED][IF(EXPRESSION)][FINAL(EXPRESSION)] { code or function }!$OMP TASKWAIT [ON( )][NOFLUSH] Defaults: data is shared consistency is copy_deps

38 OmpSs: syntax in FORTRAN Pragmas attached to the function declarations Preferably at INTERFACES INTERFACE!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, NR, 128) FILE(krist.cl) COPY_DEPS!$OMP TASK IN (A, H) OUT(E) SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE, E) INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE) END SUBROUTINE CSTRUCTFAC END INTERFACE Invocation DO II = 1, NR, NR_2 IND_H = (DIM2_H * (II - 1)) + 1 IND_E = (DIM2_E * (II - 1)) + 1 CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, & DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),& DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1)) END DO

39 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I IMPLICIT NONE DO I=1,N INTERFACE VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ VEC2(I) = N+1-I FILE(kernel.cl) COPY_DEPS RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM END INTERFACE INTEGER, PARAMETER :: N = 20 VEC1 and VEC2 are INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I sent to the GPU. No output transfer (write back) input data is already in the GPU. No input/output transfer RESULTS copied out from the GPU CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS END PROGRAM P!!! Expected IN/OUT transfers: IN = 160B OUT = 80Bwith OmpSs@CUDA/OpenCL Programming

40 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I IMPLICIT NONE DO I=1,N INTERFACE kernel void vec_sum(int n, global int* a, global int* b, global int* res) VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ VEC2(I) = N+1-I { FILE(kernel.cl) COPY_DEPS const int idx = get_global_id(0); RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE if (idx < n) IMPLICIT NONE { res[idx] = a[idx] +INTEGER, b[idx]; VALUE :: N INTEGER :: A(N), B(N), RES(N) } END SUBROUTINE VEC_SUM } END INTERFACE INTEGER, PARAMETER :: N = 20 VEC1 and VEC2 are INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I sent to the GPU. No output transfer (write back) input data is already in the GPU. No input/output transfer RESULTS copied out from the GPU CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS END PROGRAM P!!! Expected IN/OUT transfers: IN = 160B OUT = 80Bwith OmpSs@CUDA/OpenCL Programming

41 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N INTEGER :: VEC1(N), VEC2(N), RESULTS(N), PROGRAM P I DO I=1,N IMPLICIT NONE VEC1(I) = I INTERFACE VEC2(I) = N+1-I!$OMP TARGET DEVICE(OPENCL) \ RESULTS(I) = -1 NDRANGE(1, N, 128) FILE(kernel.cl) COPY_DEPS END DO!$OMP TASK IN(A, B) OUT(RES) END SUBROUTINE INITIALIZE SUBROUTINE VEC_SUM(N, A, B, RES) VEC1 and VEC2 are sent to the GPU. No output transfer (write back) RESULTS copied out from the GPU and removed VEC1 and VEC2 are sent to the GPU. No output transfer (write back) IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM END INTERFACE INTEGER, PARAMETER :: N = 20 INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS)!$OMP TASKWAIT PRINT *, "PARTIAL RESULTS: ", RESULTS CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT RESULTS copied PRINT *, "RESULTS: ", RESULTS out from the GPU END PROGRAM P removed and

42 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P I INTEGER :: VEC1(N), VEC2(N), RESULTS(N), IMPLICIT NONE DO I=1,N INTERFACE VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) \ VEC2(I) = N+1-I NDRANGE(1, N, 128) FILE(kernel.cl) COPY_DEPS RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM VEC1 and VEC2 are END INTERFACE sent to the GPU. INTEGER, PARAMETER :: N = 20 INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I No output transfer (write back) NO data copied out from the GPU Printed values are -1 All data already in the GPU. No output transfer CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS)!$OMP TASKWAIT NOFLUSH PRINT *, "PARTIAL RESULTS: ", RESULTS CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS RESULTS copied END PROGRAM P out from the GPU removed and

43 FORTRAN example, assuming write-back cache policy PROGRAM P VEC1 and VEC2 are IMPLICIT NONE INTERFACE sent to the GPU!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ No output transfer FILE(kernel.cl) COPY_DEPS!$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM(N, A, B, RES) IMPLICIT NONE INTEGER, VALUE :: N INTEGER, PARAMETER :: N = 20 INTEGER :: A(N), B(N), RES(N) INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I END SUBROUTINE VEC_SUM CALL INITIALIZE(N, VEC1, VEC2, RESULTS)!$OMP TARGET DEVICE(SMP) COPY_DEPS!$OMP TASK IN(BUFF) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) SUBROUTINE PRINT_BUFF(N, BUFF) CALL PRINT_BUFF(N, RESULTS) IMPLICIT NONE CALL VEC_SUM(N, VEC1, RESULTS, RESULTS) INTEGER, VALUE :: N INTEGER :: BUFF(N) END SUBROUTINE VEC_SUM CALL PRINT_BUFF(N, RESULTS) input data is END INTERFACE RESULTS copied to CPU and kept in GPU!$OMP TASKWAIT RESULT copied out from the GPU and removed from there already in the GPU No input transfer END PROGRAM P RESULTS copied to CPU and kept in GPU

44 Conclusions Future programming models should: Enable productivity and portability Support for heterogeneous/hierarchical architectures Support asynchrony global synchronization in systems with large number of nodes is not an answer anymore Be aware of data locality OmpSs is a proposal that enables: Incremental parallelization from existing sequential codes Data-flow execution model that naturally supports asynchrony Nicely integrates heterogeneity and hierarchy Support for locality scheduling Active and open source project:

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda

Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda www.bsc.es Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda guillermo.miranda@bsc.es Trondheim, 3 October 2013 Outline OmpSs Programming Model StarSs Comparison with OpenMP Heterogeneity (OpenCL and

More information

OmpSs Examples and Exercises

OmpSs Examples and Exercises OmpSs Examples and Exercises Release BSC Programming Models Mar 22, 2018 CONTENTS 1 Introduction 3 1.1 System configuration........................................... 3 1.2 Building the examples..........................................

More information

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre

More information

OmpSs - programming model for heterogenous and distributed platforms

OmpSs - programming model for heterogenous and distributed platforms www.bsc.es OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3 June 2013 Evolution of computers All include multicore or GPU/accelerators Parallel programming models

More information

Design and Development of support for GPU Unified Memory in OMPSS

Design and Development of support for GPU Unified Memory in OMPSS Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat

More information

OmpSs Specification. BSC Programming Models

OmpSs Specification. BSC Programming Models OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................

More information

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications

More information

Fundamentals of OmpSs

Fundamentals of OmpSs www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

OpenCL The Open Standard for Heterogeneous Parallel Programming

OpenCL The Open Standard for Heterogeneous Parallel Programming OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed

More information

StarSs support for programming heterogeneous platforms Rosa M. Badia Computer Sciences Research Dept. BSC

StarSs support for programming heterogeneous platforms Rosa M. Badia Computer Sciences Research Dept. BSC StarSs support for programming heterogeneous platforms Rosa M. Badia Computer Sciences Research Dept. BSC Evolution of computers multicore GPU, accelerator OmpSs presentation @ HeteroPar, Bordeaux (France),

More information

An Introduction to OpenAcc

An Introduction to OpenAcc An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Performing Reductions in OpenCL

Performing Reductions in OpenCL Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the

More information

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Exploring Dynamic Parallelism on OpenMP

Exploring Dynamic Parallelism on OpenMP www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction

More information

From the latency to the throughput age

From the latency to the throughput age www.bsc.es From the latency to the throughput age Jesús Labarta BSC Keynote @ IWOMP 2016 Nara, October 7 th 2016 1 Vision The multicore and memory revolution ISA leak Plethora of architectures Heterogeneity

More information

OpenMP Tasking Model Unstructured parallelism

OpenMP Tasking Model Unstructured parallelism www.bsc.es OpenMP Tasking Model Unstructured parallelism Xavier Teruel and Xavier Martorell What is a task in OpenMP? Tasks are work units whose execution may be deferred or it can be executed immediately!!!

More information

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

GPU Computing Master Clss. Development Tools

GPU Computing Master Clss. Development Tools GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime

More information

The OmpSs programming model and its runtime support

The OmpSs programming model and its runtime support www.bsc.es The OmpSs programming model and its runtime support Jesús Labarta BSC RoMoL 2016 Barcelona, March 12 th 2016 1 Vision The multicore and memory revolution ISA leak Plethora of architectures Heterogeneity

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Design Decisions for a Source-2-Source Compiler

Design Decisions for a Source-2-Source Compiler Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica

More information

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

Parallel programming languages:

Parallel programming languages: Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1 The Microelectronics Group

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

MultiGPU Made Easy by OmpSs + CUDA/OpenACC

MultiGPU Made Easy by OmpSs + CUDA/OpenACC www.bsc.es MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUD (Compute

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

OpenCL and the quest for portable performance. Tim Mattson Intel Labs

OpenCL and the quest for portable performance. Tim Mattson Intel Labs OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little

More information

Enabling GPU support for the COMPSs-Mobile framework

Enabling GPU support for the COMPSs-Mobile framework Enabling GPU support for the COMPSs-Mobile framework Francesc Lordan, Rosa M Badia and Wen-Mei Hwu Nov 13, 2017 4th Workshop on Accelerator Programming Using Directives COMPSs-Mobile infrastructure WAN

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Make the Most of OpenMP Tasking. Sergi Mateo Bellido Compiler engineer

Make the Most of OpenMP Tasking. Sergi Mateo Bellido Compiler engineer Make the Most of OpenMP Tasking Sergi Mateo Bellido Compiler engineer 14/11/2017 Outline Intro Data-sharing clauses Cutoff clauses Scheduling clauses 2 Intro: what s a task? A task is a piece of code &

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Tutorial OmpSs: Overlapping communication and computation

Tutorial OmpSs: Overlapping communication and computation www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Introduction à OpenCL

Introduction à OpenCL 1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

PGI Fortran & C Accelerator Programming Model. The Portland Group

PGI Fortran & C Accelerator Programming Model. The Portland Group PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5

More information

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1

OpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1 OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial.

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015 Code Tópicos em Arquiteturas Paralelas November 25, 2015 The Device Code GPU Device Memory Access Thread Management Private Private Thread1 Thread M Streaming Processor 0... Private Private Thread1 Thread

More information

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Overview: Graphics Processing Units

Overview: Graphics Processing Units advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply

More information

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material OpenMP 4.5 target Wednesday, June 28 th, 2017 Presenters: Tom Scogland Oscar Hernandez Credits for some of the material IWOMP 2016 tutorial James Beyer, Bronis de Supinski OpenMP 4.5 Relevant Accelerator

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment

Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment Vinoth Krishnan Elangovan, Rosa. M. Badia, Eduard Ayguadé Barcelona Supercomputing Center Universitat Politècnica

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

OpenMP API Version 5.0

OpenMP API Version 5.0 OpenMP API Version 5.0 (or: Pretty Cool & New OpenMP Stuff) Michael Klemm Chief Executive Officer OpenMP Architecture Review Board michael.klemm@openmp.org Architecture Review Board The mission of the

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder OpenMP programming Thomas Hauser Director Research Computing thomas.hauser@colorado.edu CU meetup 1 Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Lecture 10!! Introduction to CUDA!

Lecture 10!! Introduction to CUDA! 1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2018 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Aside (P3 related): linear algebra Many scientific phenomena can be modeled as matrix operations Differential

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information