Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda

Size: px
Start display at page:

Download "Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda"

Transcription

1 Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda Trondheim, 3 October 2013

2 Outline OmpSs Programming Model StarSs Comparison with OpenMP Heterogeneity (OpenCL and CUDA support) Hybrid MPI+OmpSs Why? Advantages Communication overlapping OpenCL Integration in OmpSs Directives Fortran support Performance and scalability Contact: Source code available from 2

3 OmpSs Programming Model StarSs StarSs: a family of task based programming models. SMPSs CellSs GPUSs GridSs Basic concept: write sequential on a flat single address space + directionality annotations Data dependency synchronisation Asynchronous calls OmpSs combines StarSs features into a single programming model 3

4 OmpSs Programming Model Comparison with OpenMP 3.0 (I) Dataflow is specified using data dependencies No need for parallel regions Explicit synchronization point if required (taskwait) Different task scheduling policies: Global or local task queues (wf, bf) Data locality (NUMA, GPU, cluster) Multiple implementations of the same task (versioning) 4

5 OmpSs Programming Model Comparison with OpenMP 3.0 (II) CUDA and OpenCL support Copy clauses allow to work with multiple address spaces The programmer has the perception of a single address space Runtime library will keep consistency in between memories The compiler tool-chain allows to work with heterogeneus code Multiple device architectures Different implementations of the same task (in the same or different devices) 5

6 OmpSs Programming Model Comparison with OpenMP 4.0 Dependencies Similar syntax OpenMP does not support strided memory Target directive The OpenMP target has a different meaning In OpenMP, the device compiler will be called to generate code 6

7 OmpSs Programming Model OmpSs execution model Dataflow programming model that works well with: Large and complex legacy applications Incremental parallelization methodology Programmer provides local information, runtime manages global application New or legacy applications that requires acceleration Nice support for OpenCL/CUDA kernels View of a single memory address space 7

8 OmpSs Programming Model void foo ( int *a, int *b ) { for ( i = 1; i < N; i++ ) { #pragma omp task in(a[i-1]) inout(a[i]) out(b[i]) propagate(&a[i-1],&a[i],&b[i]); #pragma omp task in(b[i-1]) inout(b[i]) correct(&b[i-1],&b[i]); } #pragma omp taskwait } 8

9 OmpSs Programming Model Tool-chain Mercurium: Source-to-source compiler Supports Fortran, C and C++ Nanos++: Common execution runtime (C, C++ and Fortran) Task creation, dependency management, task scheduling... 9

10 Hybrid MPI+OmpSs

11 Hybrid MPI+OmpSs Why? MPI is here to stay A lot of HPC applications already written in MPI MPI scales well to tens/hundreds of thousands of nodes MPI exploits inter-node parallelism while OmpSs exploits node parallelism Overlap of communication and computation through the inclusion of communication in the task graph MPI/OmpSs overcomes the too synchronous structure of MPI/OpenMP, propagating the dataflow asynchronous behavior to the MPI level 11

12 Hybrid MPI+OmpSs Execution time (lower = better) Performance Higher at smaller problem sizes Improved Load balance (less processes) Higher IPC Overlap communication/computation Tolerance to bandwidth and OS noise Performance (higher = better) Execution time (lower = better) 12

13 Hybrid MPI+OmpSs: Matrix multiplication (MPI) for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/bs)*ndim*bs*bs); size = vsize*l; if(i%2 == 0) { mxm ( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); MPI_Sendrecv (a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, comm, &stats ); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); MPI_Sendrecv (rbuf, size, MPI_DOUBLE, down, stag, a, size, MPI_DOUBLE, up, rtag, comm, &stats ); } } C A B 8 nodes, 1 proc per node rbuf void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ) { double alpha=1.0, beta=1.0; int i, j; char tr = 't'; dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m); } 13

14 Hybrid MPI+OmpSs Typical hybrid parallelization: Parallelize computation phase Serialize communication part How to overlap communication and computation? Using double buffering and Asynchronous MPI calls 2 processes per node. Easier with OmpSs... No computation/communication overlap 14

15 Hybrid MPI+OmpSs: Matmul orig_rbuf = rbuf = (double (*)[m])malloc(m*n*sizeof(double)); if (nodes >1) { up = me<nodes-1? me+1:0; down = me>0? me-1:nodes-1; } else { up = down = MPI_PROC_NULL; } a=a; i = n*me; // first C block (different for each process) size = m*n; for( it = 0; it < nodes; it++ ) { #pragma omp task in (a[0:n-1], B[0:m-1])\ inout (C[i:i+n-1][0:n-1]) firstprivate (n,m) cblas_dgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, n, n, m, 1.0, (double *)a, m, (double *)B, n, 1.0, (double *)&C[i][0], n); if (it < nodes-1) { #pragma omp task in (a[0:n-1]) out (rbuf[0:n-1]) inout(stats) \ firstprivate (size,m,n,tag,down,up) MPI_Sendrecv( a, size, MPI_DOUBLE, down, tag, rbuf, size, MPI_DOUBLE, up, tag, MPI_COMM_WORLD, &stats ); } i = (i+n)%m; ptmp=a; a=rbuf; rbuf=ptmp; //next C block circular //swap pointers C A B } #pragma omp taskwait free (orig_rbuf); rbuf 15

16 Hybrid MPI+OmpSs Problems taskifying communications: Possibly several concurrent MPI calls Need to use thread-safe MPI Reordering of MPI calls potential source of deadlocks Need to control order of communication tasks Use of sentinels if necessary 16

17 OpenCL

18 OpenCL in OmpSs Two main alternatives CUDA/OpenCL (very similar) Accelerator language (CUDA C/OpenCL C) CUDA/OpenCL Host API Explicit data management (different address spaces) Memory allocation Data transfers Kernel management (compilation, execution, synchronization ) Host memory Device memory 18

19 OpenCL in OmpSs Philosophy: C, C++ and Fortran with OpenCL and CUDA supported Based/compatible with OpenMP 3.0 Write sequential programs an run them in parallel Support most of the OpenMP 3.0 annotations Extend OpenMP with function-tasks and parameter annotations Provide dynamic parallelism and automatic dependency management BLAS kernel, OmpSs C version #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y) { for (i=0; i < size; i++) y[i] = a * x[i] + y[i]; } 19

20 OpenCL in OmpSs: OpenCL example cl_mem d_x = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); cl_mem d_y = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); const char *KernelSrc = "\n" \ " kernel void saxpy(int n, float a, \n" \ " global float* x, \n" \ " global float* y) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < n) \n" \ " y[i] = a * x[i] + y[i]; \n" \ "} clenqueuewritebuffer(cmd, d_x, CL_TRUE, 0, sizeof(float) * n, h_x, 0, 0, 0); clenqueuewritebuffer(cmd, d_y, CL_TRUE, 0, sizeof(float) * n, h_y, 0, 0, 0); clsetkernelarg(ko_saxpy, 0, 4, &n); clsetkernelarg(ko_saxpy, 1, 4, &a); size_t sze = sizeof(cl_mem); clsetkernelarg(ko_saxpy, 2, sze, &d_x); clsetkernelarg(ko_saxpy, 3, sze, &d_y); int main(int argc, char** argv) { float a, h_x[1024], h_y[1024]; // Init a, h_x and h_y; cl_uint numplats; clgetplatformids(0, 0, &numplats); cl_platform_id Plat[numPlats]; clgetplatformids(numplats, Plat, 0); clgetdeviceids(plat[i], DEV, 1, &id,0); cl_context ctx = clcreatecontext(0, 1, &id, 0, 0, 0); cl_command_queue cmd = clcreatecommandqueue(ctx, id, 0, 0); cl_program program = clcreateprogramwithsource(ctx, 1, KernelSrc, 0, 0); clbuildprogram(program, 0, 0, 0, 0, 0); cl_kernel ko_saxpy = clcreatekernel( program, "saxpy", 0); saxpy.c size_t global = 1024, local = 128; clenqueuendrangekernel(cmd, ko_saxpy, 1, 0, &global, &local, 0, 0, 0); clfinish(commands); clenqueuereadbuffer( commands, d_y, CL_TRUE, 0, 4 * n, h_y, 0, 0, 0 ); printf("%f, h_y[0]"); clreleasememobject(d_x); clreleasememobject(d_y); clreleaseprogram(program); clreleasekernel(ko_saxpy); clreleasecommandqueue(commands); clreleasecontext(ctx); return 0; saxpy.c } 20

21 OpenCL in OmpSs: OpenCL example Se rio us ly??? cl_mem d_x = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); cl_mem d_y = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); const char *KernelSrc = "\n" \ " kernel void saxpy(int n, float a, \n" \ " global float* x, \n" \ " global float* y) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < n) \n" \ " y[i] = a * x[i] + y[i]; \n" \ "} clenqueuewritebuffer(cmd, d_x, CL_TRUE, 0, sizeof(float) * n, h_x, 0, 0, 0); clenqueuewritebuffer(cmd, d_y, CL_TRUE, 0, sizeof(float) * n, h_y, 0, 0, 0); clsetkernelarg(ko_saxpy, 0, 4, &n); clsetkernelarg(ko_saxpy, 1, 4, &a); size_t sze = sizeof(cl_mem); clsetkernelarg(ko_saxpy, 2, sze, &d_x); clsetkernelarg(ko_saxpy, 3, sze, &d_y); int main(int argc, char** argv) { float a, h_x[1024], h_y[1024]; // Init a, h_x and h_y; cl_uint numplats; clgetplatformids(0, 0, &numplats); cl_platform_id Plat[numPlats]; clgetplatformids(numplats, Plat, 0); clgetdeviceids(plat[i], DEV, 1, &id,0); cl_context ctx = clcreatecontext(0, 1, &id, 0, 0, 0); cl_command_queue cmd = clcreatecommandqueue(ctx, id, 0, 0); cl_program program = clcreateprogramwithsource(ctx, 1, KernelSrc, 0, 0); clbuildprogram(program, 0, 0, 0, 0, 0); cl_kernel ko_saxpy = clcreatekernel( program, "saxpy", 0); saxpy.c size_t global = 1024, local = 128; clenqueuendrangekernel(cmd, ko_saxpy, 1, 0, &global, &local, 0, 0, 0); clfinish(commands); clenqueuereadbuffer( commands, d_y, CL_TRUE, 0, 4 * n, h_y, 0, 0, 0 ); printf("%f, h_y[0]"); clreleasememobject(d_x); clreleasememobject(d_y); clreleaseprogram(program); clreleasekernel(ko_saxpy); clreleasecommandqueue(commands); clreleasecontext(ctx); return 0; saxpy.c } 21

22 OpenCL in OmpSs: example Easy integration with CUDA/OpenCL kernels #pragma omp target device(cuda opencl) ndrange(dim, size, block_size) Identifies the following function as CUDA C/OpenCL C kernel Provides the number of dimensions, total size and block size #pragma omp task in( ) out( ) Specifies input/output as usual Automatic management of memory Device memory allocation Host/device data transfer Automatic setup of OpenCL kernels On-demand kernel compilation Transparent setting of parameters Data-flow synchronization kernel void saxpy(int n, float a, global float* x, global float* y) { int i = get_global_id(0); if(i < n) y[i] = a * x[i] + y[i]; } saxpy.cl #pragma omp target device(opencl) \ ndrange(1, n, 128) copy_deps #pragma omp task in([n]x) inout([n]y) kernel void saxpy(int n, float a, global float* x, global float* y); #define N 1024 int main(int argc, char* argv[]) { float a, x[n], y[n]; saxpy(n, a, x, y); #pragma omp taskwait print_result(y, N); return 0; } saxpy.c 22

23 OpenCL in OmpSs: target directive Specific device information: the target directive Always attached to the task directive Mandatory use of device clause #pragma omp omp target target device device (type) (type) #pragma [copy_deps] [[ copy_in(...) copy_in(...) ]] [[ copy_out(...) copy_out(...) ]] [[ copy_inout copy_inout (...) (...) ]] }} {{ [copy_deps] ndrange( ) ]] [[ file( ) file( ) ]] [[ name( ) name( ) ]] [[ ndrange( ) #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma structured-block }} {{ structured-block Device clause device(type): specific device to run the following task Current supported devices: smp, opencl, cuda, cluster, fpga, mpi Default 23

24 OpenCL in OmpSs: target clauses Target Directive (and copy clauses) #pragma omp omp target target device(type) device(type) [[ copy_in( ) copy_in( ) ]] [[ copy_out( ) copy_out( ) ]] [[ copy_inout( ) copy_inout( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma copy_in(var-list): requesting a consistent copy of variables copy_out(var-list): producing next version of variable copy_inout(var-list): combination of in and out Target Directive (and copy_deps clause) #pragma omp omp target target device(type) device(type) [[ copy_deps copy_deps ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] [[ in( ) in( ) ]] [[ out( ) out( ) ]] [[ inout( ) inout( ) ]] #pragma copy_deps: data copy information is in the dependence clauses In( ) copy_in( ) out copy_out inout copy_inout( ) 24

25 OpenCL in OmpSs: dimension clauses Target Directive (and dimension clauses) #pragma omp omp target target device(type) device(type) [[ ndrange( ) ndrange( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma ndrange(ndim,g1[, Gndim],L1[,...Lndim]): kernel execution resources ndim 1, 2 or 3 number of dimensions Gi global size for dimension i (where 1 i ndim) Li local size for dimension i (where 1 i ndim) 25

26 OpenCL in OmpSs: file and name clauses Compiling kernels at runtime (e.g. OpenCL) Target Directive (and file clause) #pragma omp omp target target device(type) device(type) [[ file( ) file( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma file(file-name): file containing kernel code Target Directive (and name clause) The source file must be passed to the compiler #pragma omp omp target target device(type) device(type) [[ name( ) name( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma name(kernel-name): kernel name not required for C/C++ Fortran code is not case sensitive (matmul = MatMul = MATMUL) Forcing fortran s user to call an specific kernel 26

27 OpenCL in OmpSs: shared memory clause Specifies the amount of shared memory allocated for the kernel Target Directive (CUDA) #pragma omp omp target target device(type) device(type) [[ shmem( shmem( expr) expr) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma expr: an expression with the number of bytes to reserve Target Directive (OpenCL) #pragma omp omp target target device(type) device(type) [[ shmem( shmem( expr1[, expr1[, expr-n expr-n...]...] )) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma Argument is a list of expressions for each parameter 27

28 OpenCL in OmpsS: MPI OmpSs perfectly integrates CUDA and OpenCL with MPI #pragma omp target device(cuda) copy_deps ndrange(1, last-first, 128 ) #pragma omp task in(particle_array[0;particles]) out([part_per_process]particle_array2) Particle_array_calculate_forces_cuda(Particle* particle_array, Particle* particle_array2, int particles, Int part_per_process, float time_interval, int first, int last); Particle_broadcast_arguments(); Particle_array_initialize(particle_array, particles); for (int i = 1; i <= timesteps; i++) { Particle_array_calculate_forces_cuda(particle_array, particle_array2, particles, part_per_process, time_interval, first, last); #pragma omp target device(smp) copy_deps #pragma omp task out(particle_array[0;particles]) in(particle_array2[first:last]) MPI_Allgather(particle_array, part_per_process*8, MPI_FLOAT, particle_array2, part_per_process*8, MPI_FLOAT, MPI_COMM_WORLD); } /* for timestep */ #pragma omp taskwait 28

29 OpenCL in OmpsS: Fortran support Fortran code can call CUDA or OpenCL kernels like C functions kernel void cstructfac(int na, int nr, int nc, float f2, int NA, global float4* a, int NH, global float4* h,int NE, global float4* E, local float4* ashared){ int i = get_global_id(0); if (i<na) {..}; } Global size/s krist.cl Local size/s INTERFACE!$OMP TARGET DEVICE(OPENCL) COPY_DEPS NDRANGE(1, NR, 128) SHMEM( ) FILE(krist.cl)!$OMP TASK IN (A, H) OUT(E) SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE, E) main.f90 INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE) Shared memory END SUBROUTINE CSTRUCTFAC Dim:1, 2 or 3 size/s END INTERFACE Input/Outputs [...] DO II = 1, NR, NR_2 IND_H = (DIM2_H * (II - 1)) + 1 IND_E = (DIM2_E * (II - 1)) + 1 CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, & DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),& DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1)) END DO [...] Normal Fortran subroutine call 29

30 OpenCL in OmpsS: performance and scalability Performance/programmability trade-off: Performance competitive with hand-written code. Lower number of code lines and API calls. Matmul and Krist: lower LoC and API calls Matmul: better performance 30

31 OpenCL in OmpsS: performance and scalability Scalability results competitive with hand-written code Transparently uses the number of available devices If there is enough parallelism 31

32 Conclusions

33 Conclusions OmpSs is a programming model that enables Incremental parallelization of sequential code Data-flow execution model (asynchronous) Nicely supports heterogeneous environments Many optimizations under the hood: Advanced scheduling policies Work stealing/load balancing Data prefetching OmpSs is open source Have a look at 33

34 Thank you! For further information please contact 34

Tutorial OmpSs: Overlapping communication and computation

Tutorial OmpSs: Overlapping communication and computation www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00

More information

Programming with

Programming with Programming with OmpSs@CUDA/OpenCL Xavier Martorell, Rosa M. Badia Computer Sciences Research Dept. BSC PATC Parallel Programming Workshop July 1-2, 2015 Agenda Motivation Leveraging OpenCL and CUDA Examples

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

Hybridizing MPI and tasking: The MPI+OmpSs experience. Jesús Labarta BSC CS Dept. Director

Hybridizing MPI and tasking: The MPI+OmpSs experience. Jesús Labarta BSC CS Dept. Director Hybridizing MPI and tasking: The MPI+OmpSs experience Jesús Labarta BSC CS Dept. Director Russian Supercomputing Days Moscow, September 25th, 2017 MPI + X Why hybrid? MPI is here to stay A lot of HPC applications

More information

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies

More information

OmpSs Specification. BSC Programming Models

OmpSs Specification. BSC Programming Models OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

Design and Development of support for GPU Unified Memory in OMPSS

Design and Development of support for GPU Unified Memory in OMPSS Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat

More information

Heterogeneous Computing

Heterogeneous Computing OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried

More information

Fundamentals of OmpSs

Fundamentals of OmpSs www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

MultiGPU Made Easy by OmpSs + CUDA/OpenACC

MultiGPU Made Easy by OmpSs + CUDA/OpenACC www.bsc.es MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUD (Compute

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming

More information

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose

More information

OmpSs Examples and Exercises

OmpSs Examples and Exercises OmpSs Examples and Exercises Release BSC Programming Models Mar 22, 2018 CONTENTS 1 Introduction 3 1.1 System configuration........................................... 3 1.2 Building the examples..........................................

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk

More information

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on

More information

Advanced OpenMP. Other threading APIs

Advanced OpenMP. Other threading APIs Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre

More information

Parallel Hybrid Computing Stéphane Bihan, CAPS

Parallel Hybrid Computing Stéphane Bihan, CAPS Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware

More information

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives Kevin Sala, X. Teruel, J. M. Perez, V. Beltran, J. Labarta 24/09/2018 OpenMPCon 2018, Barcelona Overview TAMPI Library

More information

Parallel Programming Overview

Parallel Programming Overview Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator

More information

Task-based programming models to support hierarchical algorithms

Task-based programming models to support hierarchical algorithms www.bsc.es Task-based programming models to support hierarchical algorithms Rosa M BadiaBarcelona Supercomputing Center SHAXC 2016, KAUST, 11 May 2016 Outline BSC Overview of superscalar programming model

More information

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

Hybrid/Heterogeneous Programming with StarSs. Jesus Labarta Director Computer Sciences Research Dept. BSC

Hybrid/Heterogeneous Programming with StarSs. Jesus Labarta Director Computer Sciences Research Dept. BSC Hybrid/Heterogeneous Programming with StarSs Jesus Labarta Director Computer Sciences Research Dept. BSC Index of the talk Introduction StarSs StarSs @ heterogenous nodes Hybrid MPI/StarSs Conclusion Jesus

More information

Improving the interoperability between MPI and OmpSs-2

Improving the interoperability between MPI and OmpSs-2 Improving the interoperability between MPI and OmpSs-2 Vicenç Beltran Querol vbeltran@bsc.es 19/04/2018 INTERTWinE Exascale Application Workshop, Edinburgh Why hybrid MPI+OmpSs-2 programming? Gauss-Seidel

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Hybrid MPI + OmpSs Advanced concept of OmpSs

Hybrid MPI + OmpSs Advanced concept of OmpSs www.bsc.es Hybrid MPI + OmpSs Advanced concept of OmpSs Rosa Badia, Xavier Martorell Session 2: Hybrid MPI + OmpSs and Load Balance! Hybrid programming Issues and basic examples! Load Balance! OmpSs advanced

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Exploring Dynamic Parallelism on OpenMP

Exploring Dynamic Parallelism on OpenMP www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge

More information

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

OpenMP tasking model for Ada: safety and correctness

OpenMP tasking model for Ada: safety and correctness www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

Parallelization using the GPU

Parallelization using the GPU ~ Parallelization using the GPU Scientific Computing Winter 2016/2017 Lecture 29 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 26 ~ Recap 2 / 26 MIMD Hardware: Distributed memory

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Introduction to Standard OpenMP 3.1

Introduction to Standard OpenMP 3.1 Introduction to Standard OpenMP 3.1 Massimiliano Culpo - m.culpo@cineca.it Gian Franco Marras - g.marras@cineca.it CINECA - SuperComputing Applications and Innovation Department 1 / 59 Outline 1 Introduction

More information

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake Towards Transparent and Efficient GPU Communication on InfiniBand Clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake MPI and I/O from GPU vs. CPU Traditional CPU point-of-view

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

OmpSs-2 Specification

OmpSs-2 Specification OmpSs-2 Specification Release BSC Programming Models Nov 29, 2018 CONTENTS 1 Introduction 1 1.1 License and liability disclaimer..................................... 1 1.2 A bit of history..............................................

More information

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Instructions for setting up OpenCL Lab

Instructions for setting up OpenCL Lab Instructions for setting up OpenCL Lab This document describes the procedure to setup OpenCL Lab for Linux and Windows machine. Specifically if you have limited no. of graphics cards and a no. of users

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

OpenCL in Action. Ofer Rosenberg

OpenCL in Action. Ofer Rosenberg pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

Enabling GPU support for the COMPSs-Mobile framework

Enabling GPU support for the COMPSs-Mobile framework Enabling GPU support for the COMPSs-Mobile framework Francesc Lordan, Rosa M Badia and Wen-Mei Hwu Nov 13, 2017 4th Workshop on Accelerator Programming Using Directives COMPSs-Mobile infrastructure WAN

More information

OpenCL. Computation on HybriLIT Brief introduction and getting started

OpenCL. Computation on HybriLIT Brief introduction and getting started OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Design Decisions for a Source-2-Source Compiler

Design Decisions for a Source-2-Source Compiler Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

The OmpSs programming model and its runtime support

The OmpSs programming model and its runtime support www.bsc.es The OmpSs programming model and its runtime support Jesús Labarta BSC RoMoL 2016 Barcelona, March 12 th 2016 1 Vision The multicore and memory revolution ISA leak Plethora of architectures Heterogeneity

More information

OmpSs - programming model for heterogenous and distributed platforms

OmpSs - programming model for heterogenous and distributed platforms www.bsc.es OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3 June 2013 Evolution of computers All include multicore or GPU/accelerators Parallel programming models

More information

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:... ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 016 Solutions Name:... Answer questions in space provided below questions. Use additional paper if necessary but make sure

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

StarPU: a runtime system for multigpu multicore machines

StarPU: a runtime system for multigpu multicore machines StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President 4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly

More information

Introduction to OpenCL!

Introduction to OpenCL! Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Introduction à OpenCL

Introduction à OpenCL 1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016

INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC

More information

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1 Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike

More information

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Tools for Multi-Cores and Multi-Targets

Tools for Multi-Cores and Multi-Targets Tools for Multi-Cores and Multi-Targets Sebastian Pop Advanced Micro Devices, Austin, Texas The Linux Foundation Collaboration Summit April 7, 2011 1 / 22 Sebastian Pop Tools for Multi-Cores and Multi-Targets

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information