Programming with
|
|
- Jason Williamson
- 5 years ago
- Views:
Transcription
1 Programming with Xavier Martorell, Rosa M. Badia Computer Sciences Research Dept. BSC PATC Parallel Programming Workshop July 1-2, 2015
2 Agenda Motivation Leveraging OpenCL and CUDA Examples Contact: Source code available from
3 Motivation OpenCL/CUDA coding, complex and error-prone Memory allocation Data copies to/from device memory Manual work scheduling Code and data management from the host h devh
4 Motivation Memory allocation Need to have a double memory allocation Host memory h = (float*) malloc(sizeof(*h)*dim2_h*nr); Device memory r = cudamalloc((void**)&devh,sizeof(*h)*nr*dim2_h); Different data sizes due to blocking make the code confusing h devh
5 Motivation Data copies to/from device memory copy_in/copy_out cudamemcpy(devh,h,sizeof(*h)*nr*dim2_h, cudamemcpyhosttodevice); Increased options for data overwrite compared to homogeneous programming h devh
6 Motivation Complex code/data management from the host Main.c // Initialize device, context, and buffers... memobjs[1] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srcb, NULL); // create the kernel kernel = clcreatekernel (program, dot_product, NULL); // set the args values err = clsetkernelarg (kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]); err = clsetkernelarg (kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]); err = clsetkernelarg (kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]); // set work-item dimensions global_work_size[0] = n; local_work_size[0] = 1; // execute the kernel err = clenqueuendrangekernel (cmd_queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); // read results err = clenqueuereadbuffer (cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);... kernel.cl kernel void dot_product ( global const float4 * a, global const float4 * b, global float4 * c) { int gid = get_global_id(0); c[gid] = dot(a[gid], b[gid]); }
7 Proposal: OmpSs OpenMP expressiveness Tasking StarSs expressiveness Data directionality hints (in/out/inout) Detection of dependencies at runtime Automatic data movement CUDA Leverage existing kernels
8 OmpSs: execution model Thread-pool model OpenMP parallel ignored All threads created on startup One of them (SMP) executes main... and tasks P-1 workers (SMP) execute tasks One representative (SMP to OpenCL/CUDA) per device Experimenting with a single repr. for N devices All get work from a task pool Work is labeled with possible targets Tasks with several targets are scheduled to different devices at the same time
9 OmpSs: memory model A single global address space The runtime system takes care of the devices/local memories SMP machines: no need for extra runtime support Distributed/heterogeneous environments Multiple physical address spaces exist Versions of the same data can reside on them Data consistency ensured by the runtime system
10 OmpSs: Task Syntax Based on tasks with data directionality hints: in/out/inout/concurrent Task dependences computed based on the data specified To compute dependences To allow concurrent execution of concurrent/commutative tasks #pragma omp task [ in (array_spec...)] [ out (...)] [ inout (...)] [ concurrent ( )] [commutative(...)] \ [ priority(p) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [ on (...) ] Master waits for sons or specific data availability Defaults: data is shared
11 OmpSs: Recommended Task Syntax Based on tasks with data directionality hints: in/out/inout/concurrent Task dependences computed based on the data specified To compute dependences To allow concurrent execution of concurrent/commutative tasks #pragma omp task [ depend(in: array_spec...)] [ depend(out:...)] [ depend(inout:...)] \ [ depend(concurrent: )] [depend (commutative:...)] \ [ priority(p) ] [ label(...) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [ on (...) ] Master waits for sons or specific data availability Defaults: data is shared
12 OmpSs: Device Syntax Target directive Task implementation for a device The compiler configures the kernel with ndrange #pragma omp target device ({ smp opencl cuda }) \ Support for multiple implementations of a task [ implements ( function_name )] \ [ copy_deps no_copy_deps ] [ copy_in ( array_spec,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, )] [shmem(...) ] [file(name)] [name(name)] Ask the runtime to ensure consistent data is accessible in the address space of the device #pragma omp task [ in (...)] [ out (...)] [ inout (...)] [ concurrent (...)] [ commutative (...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression)] {code block or function} #pragma omp taskwait [on ( )] [noflush] Possibility to wait for specific data, generated from one or several tasks Possibility to wait for tasks, but not necessarily data transfers Defaults: data is shared consistency is copy_deps
13 OmpSs: Device Syntax with Recommended Syntax for Tasks Target directive Task implementation for a device The compiler configures the kernel with ndrange #pragma omp target device ({ smp opencl cuda }) \ Support for multiple implementations of a task [ implements ( function_name )] \ [ copy_deps no_copy_deps ] [ copy_in ( array_spec,...)] [ copy_out (...)] [ copy_inout (...)] } \ [ndrange (dim, )] [shmem(...) ] [file(name)] [name(name)] Ask the runtime to ensure consistent data is accessible in the address space of the device #pragma omp task [ depend(in: array_spec...)] [ depend(out:...)] [ depend(inout:...)] \ [ depend(concurrent: )] [depend (commutative:...)] \ [ priority (P) ] [ label (name) ] \ [ shared(...)][private(...)][firstprivate(...)][default(...)][untied][final][if (expression) {code block or function} #pragma omp taskwait [on ( )] [noflush] Possibility to wait for specific data, generated from one or several tasks Possibility to wait for tasks, but not necessarily data transfers Defaults: data is shared consistency is copy_deps
14 Heterogeneity: the copy clauses #pragma omp target [ clauses ] #pragma omp task [ clauses ] Some devices (opencl, cuda) have their private physical address space Data must be maintaned consistent between the original address space of the program and the address space of the device copy_deps: By default, all data referenced in in, out, inout clauses is kept consistent The copy_in, copy_out, an copy_inout clauses have to be used to specify any other data that needs to be maintained consistent The copy_no_deps is a shorthand to specify that for each in/out/inout declaration, there is no need to copy the data Tasks on the program host device (smp) also have to specify directionality to ensure consistency for those arguments referenced in some other device(s) The default taskwait semantic is to ensure consistency of all the data in the original program address space
15 Heterogeneity: the copy clauses copy_in: requests a consistent copy of the data to be available in the device address space when the task is executed It may require a data transfer Automatically performed by the runtime Scheduled at its convenience Data may be already in the device The runtime keeps track of all data replication and ensures consistency Address translation may be needed copy_out: specifies that the task produces that data The runtime has to use this information to ensure data consistency May require a transfer sometime after the task completes Automatically performed by the runtime, at its convenience copy_inout: specifies that a consistent copy of the data is required on input and a new value is produced for which consistency has to be maintained
16 Heterogeneity: the OpenCL/CUDA information clauses ndrange: provides the configuration for the OpenCL/CUDA kernel ndrange ( ndim, {global/grid}_array, {local/block}_array ) ndrange ( ndim, {global grid}_dim1, {local block}_dim1, ) 1 to 3 dimensions are valid values can be provided through 1-, 2-, 3-elements arrays (global, local) two lists of 1, 2, or 3 elements, matching the number of dimensions values can be function arguments or globally accessible variables file: provides the file containing the OpenCL/CUDA source code for the specific kernel file ( filename ) Can have several kernels per file Kernel selected by the function name name: names the kernel to be invoked name ( opencl-kernel-name ) Useful in FORTRAN for capitalization of names
17 matmul #define BLOCK_SIZE 16 constant int BL_SIZE= BLOCK_SIZE; NB DIM NB DIM NB #pragma omp target #pragma omp task kernel void Muld( device(opencl) copy_deps \ ndrange(2,nb,nb,bl_size,bl_size) file (muld.cl) in([nb*nb]a,[nb*nb]b) \ inout([nb*nb]c) global REAL* A, Use global for global REAL* B, int wa, int wb, copy_in/copy_out global REAL* C, int NB); arguments NB void matmul( int m, int l, int n, int mdim, int ldim, int ndim, REAL **tilea, REAL **tileb,real **tilec ) { int i, j, k; for(i = 0;i < mdim; i++) for (k = 0; k < ldim; k++) for (j = 0; j < ndim; j++) Muld(tileA[i*lDIM+k], tileb[k*ndim+j],nb,nb, tilec[i*ndim+j],nb); }
18 matmul kernel #include "matmul_auxiliar_header.h" // defines BLOCK_SIZE // Device multiplication function // Compute C = A * B muld.cl // wa is the width of A // wb is the width of B kernel void Muld( global REAL* A, global REAL* B, int wa, int wb, global REAL* C, int NB) { // Block index, Thread index int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); // Indexes of the first/last sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE;...
19 Julia int LGWSIZE=16; #pragma omp target device(opencl) copy_deps \ ndrange(2, jc.window_size[0]/4, jc.window_size[1], \ LGWSIZE, 1) file (julia.cl) #pragma omp task out(framebuffer[0;jc.window_size[0] * jc.window_size[1]]) kernel void compute_julia(float mup0,float mup1,float mup2,float mup3, global uint32_t * framebuffer, struct julia_context jc); for (i = 0; i < iterations; i++) { getcurmu(currmu, morphtimer); morphtimer += 0.05f; compute_julia(currmu[0], currmu[1], currmu[2], currmu[3], framebuffer[other_frame(display_frame)], jc); }
20 NBody #pragma omp target device(opencl) no_copy_deps file (nbody.cl) \ copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out) \ ndrange(1, size, MAX_NUM_THREADS) #pragma omp task out([size] out) \ in(d_particles[0*size;1*size], \ d_particles d_particles[1*size;2*size], \ d_particles[2*size;3*size], \ d_particles[3*size;4*size], \ out d_particles[4*size;5*size], \ d_particles[5*size;6*size], \ d_particles[6*size;7*size], \ d_particles[7*size;8*size] ) kernel void calculate_force_func(int size, float time_interval, int number_of_particles, global Particle* d_particles, global Particle *out, int first_local, int last_local); for ( i = 0; i < number_of_particles; i += bs ) { calculate_force_func(bs, time_interval, number_of_particles, this_particle_array, &output_array[i], i, i+bs-1); }
21 Nbody SMP vs OpenCL kernels #pragma omp target device(smp) copy_deps #pragma omp task out([size] output) in([particles] d_particles) void calculate_force_smp(int size, float time_interval, int particles, Particle* d_particles, Particle *output, int first_local, int last_local) { #pragma omp target device(opencl) ndrange(1,size,128) copy_deps #pragma omp task out([size] output) in([particles] d_particles) kernel void calculate_force_ocl(int size, float time_interval, int particles, global Particle* part, global Particle *output, int first_local, int last_local) { size_t id = get_global_id(0)+first_local; if (id > last_local ) return; for (size_t id = first_local; id <= last_local; id++) { Particle* this_particle = output + id - first_local; global Particle* this_particle = output + id - first_local; float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f; float force_x = 0.0f, force_y = 0.0f, force_z = 0.0f; float total_force_x = 0.0f; float total_force_y = 0.0f; float total_force_z = 0.0f; for (int i = 0; i < particles; i++) { if (i!= id) { calculate_force(part + id, part + i, &force_x, &force_y, &force_z); for (int i = 0; i < particles; i++) { if (i!= id) { calculate_force(part + id, part + I, &force_x, &force_y, &force_z); total_force_x += force_x; total_force_y += force_y; total_force_z += force_z; total_force_x += force_x; total_force_y += force_y; total_force_z += force_z; } } } [...] } [ ] // } } }
22 OpenCL/CUDA device specifics The compiler generates a stub function task that invokes the kernel Using the information at ndrange and file clauses There is a host thread in the runtime representing each device The task body OpenCL/CUDA code is actually executed by that thread in the host That thread launches kernels Compiles, if necessary Creates buffers Sets kernel arguments Invokes kernel The runtime does the memory allocation and deallocation on the device as well as the data transfers for copy variables Different memory allocation / copy possibilities
23 Memory allocation / copy Regular memory allocation / copies p = malloc (SIZE);... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) kernel void vec_init ( global * p); Device memory allocation / using mapped memory p = nanos_opencl_malloc (SIZE);... #pragma omp target device (opencl) ndrange (1, SIZE, 16) #pragma omp task inout ([SIZE] p) kernel void vec_init ( global * p);
24 OmpSs: the implements clause Example of alternative implementations #pragma omp target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { int j; for (j=0; j < size; j++) b[j] = scalar*c[j]; } scale.c #pragma omp target device (cuda) copy_deps implements (scale_task) ndrange(1, size, 512) #pragma omp task input ([size] c) output ([size] b) void global scale_task_cuda(double *b, double *c, double scalar, int size); extern "C" { global void scale_task_cuda(double * b, double * c, double scalar, int size); }; void scale_task_cuda(double *b, double *c, double scalar, int size); { int j = blockidx.x * blockdim.x + threadidx.x; b[j] = scalar * c[j]; } scale_cuda.cu (compiled automatically)
25 Heterogeneity: the target directive #pragma omp target device (cuda) copy_deps Executed in the host #pragma omp task input ([size] c) output ([size] b) void scale_task_cuda(double *b, double *c, double scalar, int size) { dim3 dimblock, dimgrid; Kernel launched to the device. dimblock.x = 128; dimblock.y = dimblock.z = 1; b and c are pointers to the memory dimgrid.x = size/128+1; allocated by the runtime on the device scale_kernel<<<dimgrid,dimblock>>>(size, 1, b, c, scalar); } #pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ scale_task_cuda(a, B, 10.0, 1024); scale_task_cuda(b, A, 0.01, 1024); scale_task (C, A, 2.0, 1024); scale_task_cuda (D, E, 5.0, 1024); scale_task_cuda(b, C, 3.0, 1024); #pragma omp taskwait // can access any of A,B,C,D,E } A, B have to be transferred to device before task execution No data transfer. Will execute after T1 //T1 //T2 //T3 //T4 //T5 A has to be transferred to host. Can be done in parallel with T2 D, E, have to be transferred to GPU. Can be done at the very beginning C has to be transferred to GPU. Can be done when T3 finishes Copy D, E back to host
26 Heterogeneity: relaxing consistency No copy of consistent instance of input to smp. Definition of new value for output #pragma target device (smp) no_copy_deps copy_out ([size] b) #pragma omp task input ([size] c) output ([size] b) void scale_task_nocpin (double *b, double *c, double scalar, int size) { for (int j=0; j < BSIZE; j++) b[j] = scalar*c[j]; } A, B have to be transferred to device before task execution No data transfer. Will execute after T1 double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ scale_task_cuda (A, B, 1.0, 1024); scale_task_cuda (B, A, 0.1, 1024); scale_task_nocpin (C, A, 2.0, 1024); scale_task_cuda (D, E, 5.0, 1024); scale_task_cuda (C, B, 3.0, 1024); A, is not transferred to host. T3 will execute after T1 finishes but with old value of A //T1 //T2 //T3 //T4 //T5 D, E, have to be transferred to GPU. Can be done at the very beginning C value updated by T3 has to be transferred to GPU. Can be done when T3 finishes. #pragma omp taskwait noflush No copy of A, B, C, D, E back to host printf ( %f, %f\n, D[1],E[512] scale_task (E, B, 0.5, 1024); //T6 scale_task_nocpin (D, C, 0.5, 1024); //T7 Will print original unmodified value of D and E Copy B and E back to host before execution #pragma omp taskwait noflush } No copy of C back to host No copy of D, back to host but new value defined by T7
27 OmpSs runtime features Automatic and configurable overlapping of data transfers and computation In / out / inout with the kernel executions Data prefetching Transfer input data to GPU memory well before tasks need the data
28 OmpSs runtime features Configurable data cache policy Write-through Write-back after the execution of each task, output data is updated on the host only updates host data if needed Nocache cache is disabled not recommended, for testing only Configurable limit on GPU memory usage
29 OmpSs runtime features CUBLAS support and context management No need to call cublasinit / cublasshutdown NX_GPUCUBLASINIT CUBLAS functions can also be scheduled to specific streams to enable overlapping Runtime takes care of properly allocation of streams/handles
30 OmpSs Perlin Noise 1024 j=0 j=1 j= #pragma omp target device(cuda) \ ndrange(2,img_width,bs,128,1) file (perlin.cu) #pragma omp task out (output[0:bs*rowstride-1]) global void cuda_perlin (pixel output [ ], float time, int j, int rowstride,int img_width, int BS);
31 Mercurium compiler Transforming Directives to runtime calls Compiling Natively with the Nvidia compiler C/C++/Fortran Embedding Object files with CUDA embedded on host files
32 Execution options (SMP & generic) NX_PES=<P> # number of SMP cores to use # scheduling policy NX_SCHEDULE=<bf dbf wf socket versioning> versioning checks performance and decides how to use devices # additional arguments to pass to Nanos++ NX_ARGS="..." # enabling priority support NX_ARGS="--schedule-priority" NX_ARGS="--schedule-smart-priority"
33 Execution options (SMP & generic) # Caching mechanism NX_CACHE_POLICY=<wt wb nocache> # cache policy to use # Generating Paraver traces NX_INSTRUMENTATION=extrae # enables instrumentation EXTRAE_CONFIG_FILE=xmlfile # sets the XML file for Extrae to use # Disabling CUDA/OpenCL support NX_ARGS=" --disable-cuda --disable-opencl "
34 Execution options (OpenCL) NX_OPENCL_MAX_DEVICES=<N> # number of devices (GPUs, MICs NX_OPENCL_DEVICE_TYPE=<GPU CPU> # type of device to use NX_OPENCL_CACHE_POLICY=<wb wt nocache> NX_DISABLEOPENCL=<yes no>
35 Execution options (CUDA) NX_GPUS=<N> # number of devices (GPUs) NX_GPUPREFETCH=<yes no> # enable prefetch of task data NX_GPUOVERLAP=<yes no> # enable overlapping of transfers NX_GPUOVERLAP_INPUTS=<yes no> # and computation NX_GPUOVERLAP_OUTPUTS=<yes no> NX_GPUMAXMEM=<M> # amount of memory to preallocate NX_GPUCUBLASINIT=yes # initialize CUBLAS Check output of $ nanox --help
36 OmpSs: Syntax in FORTRAN!$OMP TARGET DEVICE({SMP CUDA OPENCL MPI})& [NDRANGE ( )]& [IMPLEMENTS (function_name)]& {[COPY_DEPS][NO_COPY_DEPS [COPY_IN ( )][COPY_OUT ( )][COPY_INOUT(..)]}& [SHMEM ( )][FILE ( )][NAME ( )]!$OMP TASK [IN ( )] [OUT(...)] [INOUT ( )] [CONCURRENT ( )][COMMUTATIVE ( )] & [PRIORITY ( )][LABEL ( )] & [SHARED( )][PRIVATE( )][FIRSTPRIVATE( )][DEFAULT ( )] & [UNTIED][IF(EXPRESSION)][FINAL( )] { code or function }!$OMP TASKWAIT [ON( )][NOFLUSH] Defaults: data is shared consistency is copy_deps
37 OmpSs: Recommended Syntax in FORTRAN!$OMP TARGET DEVICE({SMP CUDA OPENCL MPI})& [NDRANGE ( )]& [IMPLEMENTS (function_name)]& {[COPY_DEPS][NO_COPY_DEPS [COPY_IN ( )][COPY_OUT ( )][COPY_INOUT(..)]}& [SHMEM ( )][FILE ( )][NAME ( )]!$OMP TASK [DEPEND(IN: )] [DEPEND(OUT:...)] [DEPEND(INOUT: )] & [DEPEND(CONCURRENT: )] [DEPEND(COMMUTATIVE: )] & [PRIORITY ( )][LABEL ( )] & [SHARED( )][PRIVATE( )][FIRSTPRIVATE( )][DEFAULT ( )] & [UNTIED][IF(EXPRESSION)][FINAL(EXPRESSION)] { code or function }!$OMP TASKWAIT [ON( )][NOFLUSH] Defaults: data is shared consistency is copy_deps
38 OmpSs: syntax in FORTRAN Pragmas attached to the function declarations Preferably at INTERFACES INTERFACE!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, NR, 128) FILE(krist.cl) COPY_DEPS!$OMP TASK IN (A, H) OUT(E) SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE, E) INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE) END SUBROUTINE CSTRUCTFAC END INTERFACE Invocation DO II = 1, NR, NR_2 IND_H = (DIM2_H * (II - 1)) + 1 IND_E = (DIM2_E * (II - 1)) + 1 CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, & DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),& DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1)) END DO
39 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I IMPLICIT NONE DO I=1,N INTERFACE VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ VEC2(I) = N+1-I FILE(kernel.cl) COPY_DEPS RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM END INTERFACE INTEGER, PARAMETER :: N = 20 VEC1 and VEC2 are INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I sent to the GPU. No output transfer (write back) input data is already in the GPU. No input/output transfer RESULTS copied out from the GPU CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS END PROGRAM P!!! Expected IN/OUT transfers: IN = 160B OUT = 80Bwith OmpSs@CUDA/OpenCL Programming
40 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I IMPLICIT NONE DO I=1,N INTERFACE kernel void vec_sum(int n, global int* a, global int* b, global int* res) VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ VEC2(I) = N+1-I { FILE(kernel.cl) COPY_DEPS const int idx = get_global_id(0); RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE if (idx < n) IMPLICIT NONE { res[idx] = a[idx] +INTEGER, b[idx]; VALUE :: N INTEGER :: A(N), B(N), RES(N) } END SUBROUTINE VEC_SUM } END INTERFACE INTEGER, PARAMETER :: N = 20 VEC1 and VEC2 are INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I sent to the GPU. No output transfer (write back) input data is already in the GPU. No input/output transfer RESULTS copied out from the GPU CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS END PROGRAM P!!! Expected IN/OUT transfers: IN = 160B OUT = 80Bwith OmpSs@CUDA/OpenCL Programming
41 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N INTEGER :: VEC1(N), VEC2(N), RESULTS(N), PROGRAM P I DO I=1,N IMPLICIT NONE VEC1(I) = I INTERFACE VEC2(I) = N+1-I!$OMP TARGET DEVICE(OPENCL) \ RESULTS(I) = -1 NDRANGE(1, N, 128) FILE(kernel.cl) COPY_DEPS END DO!$OMP TASK IN(A, B) OUT(RES) END SUBROUTINE INITIALIZE SUBROUTINE VEC_SUM(N, A, B, RES) VEC1 and VEC2 are sent to the GPU. No output transfer (write back) RESULTS copied out from the GPU and removed VEC1 and VEC2 are sent to the GPU. No output transfer (write back) IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM END INTERFACE INTEGER, PARAMETER :: N = 20 INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS)!$OMP TASKWAIT PRINT *, "PARTIAL RESULTS: ", RESULTS CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT RESULTS copied PRINT *, "RESULTS: ", RESULTS out from the GPU END PROGRAM P removed and
42 FORTRAN example, assuming write-back cache policy SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS) IMPLICIT NONE INTEGER :: N PROGRAM P I INTEGER :: VEC1(N), VEC2(N), RESULTS(N), IMPLICIT NONE DO I=1,N INTERFACE VEC1(I) = I!$OMP TARGET DEVICE(OPENCL) \ VEC2(I) = N+1-I NDRANGE(1, N, 128) FILE(kernel.cl) COPY_DEPS RESULTS(I) = -1!$OMP TASK IN(A, B) OUT(RES) END DO SUBROUTINE VEC_SUM(N, A, B, RES) END SUBROUTINE INITIALIZE IMPLICIT NONE INTEGER, VALUE :: N INTEGER :: A(N), B(N), RES(N) END SUBROUTINE VEC_SUM VEC1 and VEC2 are END INTERFACE sent to the GPU. INTEGER, PARAMETER :: N = 20 INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I No output transfer (write back) NO data copied out from the GPU Printed values are -1 All data already in the GPU. No output transfer CALL INITIALIZE(N, VEC1, VEC2, RESULTS) CALL VEC_SUM(N, VEC1, VEC2, RESULTS)!$OMP TASKWAIT NOFLUSH PRINT *, "PARTIAL RESULTS: ", RESULTS CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)!$OMP TASKWAIT PRINT *, "RESULTS: ", RESULTS RESULTS copied END PROGRAM P out from the GPU removed and
43 FORTRAN example, assuming write-back cache policy PROGRAM P VEC1 and VEC2 are IMPLICIT NONE INTERFACE sent to the GPU!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128)\ No output transfer FILE(kernel.cl) COPY_DEPS!$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM(N, A, B, RES) IMPLICIT NONE INTEGER, VALUE :: N INTEGER, PARAMETER :: N = 20 INTEGER :: A(N), B(N), RES(N) INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I END SUBROUTINE VEC_SUM CALL INITIALIZE(N, VEC1, VEC2, RESULTS)!$OMP TARGET DEVICE(SMP) COPY_DEPS!$OMP TASK IN(BUFF) CALL VEC_SUM(N, VEC1, VEC2, RESULTS) SUBROUTINE PRINT_BUFF(N, BUFF) CALL PRINT_BUFF(N, RESULTS) IMPLICIT NONE CALL VEC_SUM(N, VEC1, RESULTS, RESULTS) INTEGER, VALUE :: N INTEGER :: BUFF(N) END SUBROUTINE VEC_SUM CALL PRINT_BUFF(N, RESULTS) input data is END INTERFACE RESULTS copied to CPU and kept in GPU!$OMP TASKWAIT RESULT copied out from the GPU and removed from there already in the GPU No input transfer END PROGRAM P RESULTS copied to CPU and kept in GPU
44 Conclusions Future programming models should: Enable productivity and portability Support for heterogeneous/hierarchical architectures Support asynchrony global synchronization in systems with large number of nodes is not an answer anymore Be aware of data locality OmpSs is a proposal that enables: Incremental parallelization from existing sequential codes Data-flow execution model that naturally supports asynchrony Nicely integrates heterogeneity and hierarchy Support for locality scheduling Active and open source project:
PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC
PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationHybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda
www.bsc.es Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda guillermo.miranda@bsc.es Trondheim, 3 October 2013 Outline OmpSs Programming Model StarSs Comparison with OpenMP Heterogeneity (OpenCL and
More informationOmpSs Examples and Exercises
OmpSs Examples and Exercises Release BSC Programming Models Mar 22, 2018 CONTENTS 1 Introduction 3 1.1 System configuration........................................... 3 1.2 Building the examples..........................................
More informationOmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel
OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies
More informationOpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs
www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE
More informationALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems
www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre
More informationOmpSs - programming model for heterogenous and distributed platforms
www.bsc.es OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3 June 2013 Evolution of computers All include multicore or GPU/accelerators Parallel programming models
More informationDesign and Development of support for GPU Unified Memory in OMPSS
Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat
More informationOmpSs Specification. BSC Programming Models
OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................
More informationProgramming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University
Programming with CUDA and OpenCL Dana Schaa and Byunghyun Jang Northeastern University Tutorial Overview CUDA - Architecture and programming model - Strengths and limitations of the GPU - Example applications
More informationFundamentals of OmpSs
www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1
Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike
More informationOpenCL The Open Standard for Heterogeneous Parallel Programming
OpenCL The Open Standard for Heterogeneous Parallel Programming March 2009 Copyright Khronos Group, 2009 - Page 1 Close-to-the-Silicon Standards Khronos creates Foundation-Level acceleration APIs - Needed
More informationStarSs support for programming heterogeneous platforms Rosa M. Badia Computer Sciences Research Dept. BSC
StarSs support for programming heterogeneous platforms Rosa M. Badia Computer Sciences Research Dept. BSC Evolution of computers multicore GPU, accelerator OmpSs presentation @ HeteroPar, Bordeaux (France),
More informationAn Introduction to OpenAcc
An Introduction to OpenAcc ECS 158 Final Project Robert Gonzales Matthew Martin Nile Mittow Ryan Rasmuss Spring 2016 1 Introduction: What is OpenAcc? OpenAcc stands for Open Accelerators. Developed by
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationPerforming Reductions in OpenCL
Performing Reductions in OpenCL Mike Bailey mjb@cs.oregonstate.edu opencl.reduction.pptx Recall the OpenCL Model Kernel Global Constant Local Local Local Local Work- ItemWork- ItemWork- Item Here s the
More informationA Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationExploring Dynamic Parallelism on OpenMP
www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction
More informationFrom the latency to the throughput age
www.bsc.es From the latency to the throughput age Jesús Labarta BSC Keynote @ IWOMP 2016 Nara, October 7 th 2016 1 Vision The multicore and memory revolution ISA leak Plethora of architectures Heterogeneity
More informationOpenMP Tasking Model Unstructured parallelism
www.bsc.es OpenMP Tasking Model Unstructured parallelism Xavier Teruel and Xavier Martorell What is a task in OpenMP? Tasks are work units whose execution may be deferred or it can be executed immediately!!!
More informationOpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop
OpenCL Dr. David Brayford, LRZ, brayford@lrz.de PRACE PATC: Intel MIC & GPU Programming Workshop 1 Open Computing Language Open, royalty-free standard C-language extension For cross-platform, parallel
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationGPU Computing Master Clss. Development Tools
GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime
More informationThe OmpSs programming model and its runtime support
www.bsc.es The OmpSs programming model and its runtime support Jesús Labarta BSC RoMoL 2016 Barcelona, March 12 th 2016 1 Vision The multicore and memory revolution ISA leak Plethora of architectures Heterogeneity
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationDesign Decisions for a Source-2-Source Compiler
Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica
More informationObjective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC
GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationOpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL C Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL C Is used to write kernels when working with OpenCL Used to code the part that runs on the device Based on C99 with some extensions
More informationGPU Computing with CUDA
GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationOutline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun
Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationBasic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of
More informationParallel programming languages:
Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group simonm@cs.bris.ac.uk 1 The Microelectronics Group
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationMultiGPU Made Easy by OmpSs + CUDA/OpenACC
www.bsc.es MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUD (Compute
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationOpenCL and the quest for portable performance. Tim Mattson Intel Labs
OpenCL and the quest for portable performance Tim Mattson Intel Labs Disclaimer The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know very little
More informationEnabling GPU support for the COMPSs-Mobile framework
Enabling GPU support for the COMPSs-Mobile framework Francesc Lordan, Rosa M Badia and Wen-Mei Hwu Nov 13, 2017 4th Workshop on Accelerator Programming Using Directives COMPSs-Mobile infrastructure WAN
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President
4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationMake the Most of OpenMP Tasking. Sergi Mateo Bellido Compiler engineer
Make the Most of OpenMP Tasking Sergi Mateo Bellido Compiler engineer 14/11/2017 Outline Intro Data-sharing clauses Cutoff clauses Scheduling clauses 2 Intro: what s a task? A task is a piece of code &
More informationSession 4: Parallel Programming with OpenMP
Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationTutorial OmpSs: Overlapping communication and computation
www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationRegister file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.
Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationIntroduction à OpenCL
1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationPGI Fortran & C Accelerator Programming Model. The Portland Group
PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5
More informationOpenCL. An Introduction for HPC programmers. Benedict Gaster, AMD Tim Mattson, Intel. - Page 1
OpenCL An Introduction for HPC programmers Benedict Gaster, AMD Tim Mattson, Intel - Page 1 Preliminaries: Disclosures - The views expressed in this tutorial are those of the people delivering the tutorial.
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationIntroduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015
Code Tópicos em Arquiteturas Paralelas November 25, 2015 The Device Code GPU Device Memory Access Thread Management Private Private Thread1 Thread M Streaming Processor 0... Private Private Thread1 Thread
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationOpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material
OpenMP 4.5 target Wednesday, June 28 th, 2017 Presenters: Tom Scogland Oscar Hernandez Credits for some of the material IWOMP 2016 tutorial James Beyer, Bronis de Supinski OpenMP 4.5 Relevant Accelerator
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationScalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment
Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment Vinoth Krishnan Elangovan, Rosa. M. Badia, Eduard Ayguadé Barcelona Supercomputing Center Universitat Politècnica
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationOpenMP API Version 5.0
OpenMP API Version 5.0 (or: Pretty Cool & New OpenMP Stuff) Michael Klemm Chief Executive Officer OpenMP Architecture Review Board michael.klemm@openmp.org Architecture Review Board The mission of the
More informationLecture 2: Introduction to CUDA C
CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or
More informationOpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder
OpenMP programming Thomas Hauser Director Research Computing thomas.hauser@colorado.edu CU meetup 1 Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationCUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationLecture 10!! Introduction to CUDA!
1(50) Lecture 10 Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY 1(50) Laborations Some revisions may happen while making final adjustments for Linux Mint. Last minute changes may occur.
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2018 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Aside (P3 related): linear algebra Many scientific phenomena can be modeled as matrix operations Differential
More informationModule 2: Introduction to CUDA C
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More information