Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda
|
|
- Fay Powers
- 5 years ago
- Views:
Transcription
1 Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda Trondheim, 3 October 2013
2 Outline OmpSs Programming Model StarSs Comparison with OpenMP Heterogeneity (OpenCL and CUDA support) Hybrid MPI+OmpSs Why? Advantages Communication overlapping OpenCL Integration in OmpSs Directives Fortran support Performance and scalability Contact: Source code available from 2
3 OmpSs Programming Model StarSs StarSs: a family of task based programming models. SMPSs CellSs GPUSs GridSs Basic concept: write sequential on a flat single address space + directionality annotations Data dependency synchronisation Asynchronous calls OmpSs combines StarSs features into a single programming model 3
4 OmpSs Programming Model Comparison with OpenMP 3.0 (I) Dataflow is specified using data dependencies No need for parallel regions Explicit synchronization point if required (taskwait) Different task scheduling policies: Global or local task queues (wf, bf) Data locality (NUMA, GPU, cluster) Multiple implementations of the same task (versioning) 4
5 OmpSs Programming Model Comparison with OpenMP 3.0 (II) CUDA and OpenCL support Copy clauses allow to work with multiple address spaces The programmer has the perception of a single address space Runtime library will keep consistency in between memories The compiler tool-chain allows to work with heterogeneus code Multiple device architectures Different implementations of the same task (in the same or different devices) 5
6 OmpSs Programming Model Comparison with OpenMP 4.0 Dependencies Similar syntax OpenMP does not support strided memory Target directive The OpenMP target has a different meaning In OpenMP, the device compiler will be called to generate code 6
7 OmpSs Programming Model OmpSs execution model Dataflow programming model that works well with: Large and complex legacy applications Incremental parallelization methodology Programmer provides local information, runtime manages global application New or legacy applications that requires acceleration Nice support for OpenCL/CUDA kernels View of a single memory address space 7
8 OmpSs Programming Model void foo ( int *a, int *b ) { for ( i = 1; i < N; i++ ) { #pragma omp task in(a[i-1]) inout(a[i]) out(b[i]) propagate(&a[i-1],&a[i],&b[i]); #pragma omp task in(b[i-1]) inout(b[i]) correct(&b[i-1],&b[i]); } #pragma omp taskwait } 8
9 OmpSs Programming Model Tool-chain Mercurium: Source-to-source compiler Supports Fortran, C and C++ Nanos++: Common execution runtime (C, C++ and Fortran) Task creation, dependency management, task scheduling... 9
10 Hybrid MPI+OmpSs
11 Hybrid MPI+OmpSs Why? MPI is here to stay A lot of HPC applications already written in MPI MPI scales well to tens/hundreds of thousands of nodes MPI exploits inter-node parallelism while OmpSs exploits node parallelism Overlap of communication and computation through the inclusion of communication in the task graph MPI/OmpSs overcomes the too synchronous structure of MPI/OpenMP, propagating the dataflow asynchronous behavior to the MPI level 11
12 Hybrid MPI+OmpSs Execution time (lower = better) Performance Higher at smaller problem sizes Improved Load balance (less processes) Higher IPC Overlap communication/computation Tolerance to bandwidth and OS noise Performance (higher = better) Execution time (lower = better) 12
13 Hybrid MPI+OmpSs: Matrix multiplication (MPI) for( i = 0; i < processes; i++ ) { stag = i + 1; rtag = stag; indx = (me + nodes - i )%processes; shift = ((offset[indx][0]/bs)*ndim*bs*bs); size = vsize*l; if(i%2 == 0) { mxm ( lda, sizes[indx][0], lda, hsize, a, b, (c+shift) ); MPI_Sendrecv (a, size, MPI_DOUBLE, down, stag, rbuf, size, MPI_DOUBLE, up, rtag, comm, &stats ); } else { mxm (lda, sizes[indx][0], lda, hsize, rbuf, b, (c+shift) ); MPI_Sendrecv (rbuf, size, MPI_DOUBLE, down, stag, a, size, MPI_DOUBLE, up, rtag, comm, &stats ); } } C A B 8 nodes, 1 proc per node rbuf void mxm ( int lda, int m, int l, int n, double *a, double *b, double *c ) { double alpha=1.0, beta=1.0; int i, j; char tr = 't'; dgemm(&tr, &tr, &m, &n, &l, &alpha, a, &lda, b, &m, &beta, c, &m); } 13
14 Hybrid MPI+OmpSs Typical hybrid parallelization: Parallelize computation phase Serialize communication part How to overlap communication and computation? Using double buffering and Asynchronous MPI calls 2 processes per node. Easier with OmpSs... No computation/communication overlap 14
15 Hybrid MPI+OmpSs: Matmul orig_rbuf = rbuf = (double (*)[m])malloc(m*n*sizeof(double)); if (nodes >1) { up = me<nodes-1? me+1:0; down = me>0? me-1:nodes-1; } else { up = down = MPI_PROC_NULL; } a=a; i = n*me; // first C block (different for each process) size = m*n; for( it = 0; it < nodes; it++ ) { #pragma omp task in (a[0:n-1], B[0:m-1])\ inout (C[i:i+n-1][0:n-1]) firstprivate (n,m) cblas_dgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, n, n, m, 1.0, (double *)a, m, (double *)B, n, 1.0, (double *)&C[i][0], n); if (it < nodes-1) { #pragma omp task in (a[0:n-1]) out (rbuf[0:n-1]) inout(stats) \ firstprivate (size,m,n,tag,down,up) MPI_Sendrecv( a, size, MPI_DOUBLE, down, tag, rbuf, size, MPI_DOUBLE, up, tag, MPI_COMM_WORLD, &stats ); } i = (i+n)%m; ptmp=a; a=rbuf; rbuf=ptmp; //next C block circular //swap pointers C A B } #pragma omp taskwait free (orig_rbuf); rbuf 15
16 Hybrid MPI+OmpSs Problems taskifying communications: Possibly several concurrent MPI calls Need to use thread-safe MPI Reordering of MPI calls potential source of deadlocks Need to control order of communication tasks Use of sentinels if necessary 16
17 OpenCL
18 OpenCL in OmpSs Two main alternatives CUDA/OpenCL (very similar) Accelerator language (CUDA C/OpenCL C) CUDA/OpenCL Host API Explicit data management (different address spaces) Memory allocation Data transfers Kernel management (compilation, execution, synchronization ) Host memory Device memory 18
19 OpenCL in OmpSs Philosophy: C, C++ and Fortran with OpenCL and CUDA supported Based/compatible with OpenMP 3.0 Write sequential programs an run them in parallel Support most of the OpenMP 3.0 annotations Extend OpenMP with function-tasks and parameter annotations Provide dynamic parallelism and automatic dependency management BLAS kernel, OmpSs C version #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y) { for (i=0; i < size; i++) y[i] = a * x[i] + y[i]; } 19
20 OpenCL in OmpSs: OpenCL example cl_mem d_x = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); cl_mem d_y = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); const char *KernelSrc = "\n" \ " kernel void saxpy(int n, float a, \n" \ " global float* x, \n" \ " global float* y) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < n) \n" \ " y[i] = a * x[i] + y[i]; \n" \ "} clenqueuewritebuffer(cmd, d_x, CL_TRUE, 0, sizeof(float) * n, h_x, 0, 0, 0); clenqueuewritebuffer(cmd, d_y, CL_TRUE, 0, sizeof(float) * n, h_y, 0, 0, 0); clsetkernelarg(ko_saxpy, 0, 4, &n); clsetkernelarg(ko_saxpy, 1, 4, &a); size_t sze = sizeof(cl_mem); clsetkernelarg(ko_saxpy, 2, sze, &d_x); clsetkernelarg(ko_saxpy, 3, sze, &d_y); int main(int argc, char** argv) { float a, h_x[1024], h_y[1024]; // Init a, h_x and h_y; cl_uint numplats; clgetplatformids(0, 0, &numplats); cl_platform_id Plat[numPlats]; clgetplatformids(numplats, Plat, 0); clgetdeviceids(plat[i], DEV, 1, &id,0); cl_context ctx = clcreatecontext(0, 1, &id, 0, 0, 0); cl_command_queue cmd = clcreatecommandqueue(ctx, id, 0, 0); cl_program program = clcreateprogramwithsource(ctx, 1, KernelSrc, 0, 0); clbuildprogram(program, 0, 0, 0, 0, 0); cl_kernel ko_saxpy = clcreatekernel( program, "saxpy", 0); saxpy.c size_t global = 1024, local = 128; clenqueuendrangekernel(cmd, ko_saxpy, 1, 0, &global, &local, 0, 0, 0); clfinish(commands); clenqueuereadbuffer( commands, d_y, CL_TRUE, 0, 4 * n, h_y, 0, 0, 0 ); printf("%f, h_y[0]"); clreleasememobject(d_x); clreleasememobject(d_y); clreleaseprogram(program); clreleasekernel(ko_saxpy); clreleasecommandqueue(commands); clreleasecontext(ctx); return 0; saxpy.c } 20
21 OpenCL in OmpSs: OpenCL example Se rio us ly??? cl_mem d_x = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); cl_mem d_y = clcreatebuffer(ctx, 0, sizeof(float) * n, 0, 0); const char *KernelSrc = "\n" \ " kernel void saxpy(int n, float a, \n" \ " global float* x, \n" \ " global float* y) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < n) \n" \ " y[i] = a * x[i] + y[i]; \n" \ "} clenqueuewritebuffer(cmd, d_x, CL_TRUE, 0, sizeof(float) * n, h_x, 0, 0, 0); clenqueuewritebuffer(cmd, d_y, CL_TRUE, 0, sizeof(float) * n, h_y, 0, 0, 0); clsetkernelarg(ko_saxpy, 0, 4, &n); clsetkernelarg(ko_saxpy, 1, 4, &a); size_t sze = sizeof(cl_mem); clsetkernelarg(ko_saxpy, 2, sze, &d_x); clsetkernelarg(ko_saxpy, 3, sze, &d_y); int main(int argc, char** argv) { float a, h_x[1024], h_y[1024]; // Init a, h_x and h_y; cl_uint numplats; clgetplatformids(0, 0, &numplats); cl_platform_id Plat[numPlats]; clgetplatformids(numplats, Plat, 0); clgetdeviceids(plat[i], DEV, 1, &id,0); cl_context ctx = clcreatecontext(0, 1, &id, 0, 0, 0); cl_command_queue cmd = clcreatecommandqueue(ctx, id, 0, 0); cl_program program = clcreateprogramwithsource(ctx, 1, KernelSrc, 0, 0); clbuildprogram(program, 0, 0, 0, 0, 0); cl_kernel ko_saxpy = clcreatekernel( program, "saxpy", 0); saxpy.c size_t global = 1024, local = 128; clenqueuendrangekernel(cmd, ko_saxpy, 1, 0, &global, &local, 0, 0, 0); clfinish(commands); clenqueuereadbuffer( commands, d_y, CL_TRUE, 0, 4 * n, h_y, 0, 0, 0 ); printf("%f, h_y[0]"); clreleasememobject(d_x); clreleasememobject(d_y); clreleaseprogram(program); clreleasekernel(ko_saxpy); clreleasecommandqueue(commands); clreleasecontext(ctx); return 0; saxpy.c } 21
22 OpenCL in OmpSs: example Easy integration with CUDA/OpenCL kernels #pragma omp target device(cuda opencl) ndrange(dim, size, block_size) Identifies the following function as CUDA C/OpenCL C kernel Provides the number of dimensions, total size and block size #pragma omp task in( ) out( ) Specifies input/output as usual Automatic management of memory Device memory allocation Host/device data transfer Automatic setup of OpenCL kernels On-demand kernel compilation Transparent setting of parameters Data-flow synchronization kernel void saxpy(int n, float a, global float* x, global float* y) { int i = get_global_id(0); if(i < n) y[i] = a * x[i] + y[i]; } saxpy.cl #pragma omp target device(opencl) \ ndrange(1, n, 128) copy_deps #pragma omp task in([n]x) inout([n]y) kernel void saxpy(int n, float a, global float* x, global float* y); #define N 1024 int main(int argc, char* argv[]) { float a, x[n], y[n]; saxpy(n, a, x, y); #pragma omp taskwait print_result(y, N); return 0; } saxpy.c 22
23 OpenCL in OmpSs: target directive Specific device information: the target directive Always attached to the task directive Mandatory use of device clause #pragma omp omp target target device device (type) (type) #pragma [copy_deps] [[ copy_in(...) copy_in(...) ]] [[ copy_out(...) copy_out(...) ]] [[ copy_inout copy_inout (...) (...) ]] }} {{ [copy_deps] ndrange( ) ]] [[ file( ) file( ) ]] [[ name( ) name( ) ]] [[ ndrange( ) #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma structured-block }} {{ structured-block Device clause device(type): specific device to run the following task Current supported devices: smp, opencl, cuda, cluster, fpga, mpi Default 23
24 OpenCL in OmpSs: target clauses Target Directive (and copy clauses) #pragma omp omp target target device(type) device(type) [[ copy_in( ) copy_in( ) ]] [[ copy_out( ) copy_out( ) ]] [[ copy_inout( ) copy_inout( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma copy_in(var-list): requesting a consistent copy of variables copy_out(var-list): producing next version of variable copy_inout(var-list): combination of in and out Target Directive (and copy_deps clause) #pragma omp omp target target device(type) device(type) [[ copy_deps copy_deps ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] [[ in( ) in( ) ]] [[ out( ) out( ) ]] [[ inout( ) inout( ) ]] #pragma copy_deps: data copy information is in the dependence clauses In( ) copy_in( ) out copy_out inout copy_inout( ) 24
25 OpenCL in OmpSs: dimension clauses Target Directive (and dimension clauses) #pragma omp omp target target device(type) device(type) [[ ndrange( ) ndrange( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma ndrange(ndim,g1[, Gndim],L1[,...Lndim]): kernel execution resources ndim 1, 2 or 3 number of dimensions Gi global size for dimension i (where 1 i ndim) Li local size for dimension i (where 1 i ndim) 25
26 OpenCL in OmpSs: file and name clauses Compiling kernels at runtime (e.g. OpenCL) Target Directive (and file clause) #pragma omp omp target target device(type) device(type) [[ file( ) file( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma file(file-name): file containing kernel code Target Directive (and name clause) The source file must be passed to the compiler #pragma omp omp target target device(type) device(type) [[ name( ) name( ) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma name(kernel-name): kernel name not required for C/C++ Fortran code is not case sensitive (matmul = MatMul = MATMUL) Forcing fortran s user to call an specific kernel 26
27 OpenCL in OmpSs: shared memory clause Specifies the amount of shared memory allocated for the kernel Target Directive (CUDA) #pragma omp omp target target device(type) device(type) [[ shmem( shmem( expr) expr) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma expr: an expression with the number of bytes to reserve Target Directive (OpenCL) #pragma omp omp target target device(type) device(type) [[ shmem( shmem( expr1[, expr1[, expr-n expr-n...]...] )) ]] #pragma #pragma omp omp task task [clause[[,] [clause[[,] clause]...] clause]...] #pragma Argument is a list of expressions for each parameter 27
28 OpenCL in OmpsS: MPI OmpSs perfectly integrates CUDA and OpenCL with MPI #pragma omp target device(cuda) copy_deps ndrange(1, last-first, 128 ) #pragma omp task in(particle_array[0;particles]) out([part_per_process]particle_array2) Particle_array_calculate_forces_cuda(Particle* particle_array, Particle* particle_array2, int particles, Int part_per_process, float time_interval, int first, int last); Particle_broadcast_arguments(); Particle_array_initialize(particle_array, particles); for (int i = 1; i <= timesteps; i++) { Particle_array_calculate_forces_cuda(particle_array, particle_array2, particles, part_per_process, time_interval, first, last); #pragma omp target device(smp) copy_deps #pragma omp task out(particle_array[0;particles]) in(particle_array2[first:last]) MPI_Allgather(particle_array, part_per_process*8, MPI_FLOAT, particle_array2, part_per_process*8, MPI_FLOAT, MPI_COMM_WORLD); } /* for timestep */ #pragma omp taskwait 28
29 OpenCL in OmpsS: Fortran support Fortran code can call CUDA or OpenCL kernels like C functions kernel void cstructfac(int na, int nr, int nc, float f2, int NA, global float4* a, int NH, global float4* h,int NE, global float4* E, local float4* ashared){ int i = get_global_id(0); if (i<na) {..}; } Global size/s krist.cl Local size/s INTERFACE!$OMP TARGET DEVICE(OPENCL) COPY_DEPS NDRANGE(1, NR, 128) SHMEM( ) FILE(krist.cl)!$OMP TASK IN (A, H) OUT(E) SUBROUTINE CSTRUCTFAC(NA, NR, NC, F2, DIM_NA, A, DIM_NH, H, DIM_NE, E) main.f90 INTEGER :: NA, NR, NC, DIM_NA, DIM_NH, DIM_NE REAL :: F2, A(DIM_NA), H(DIM_NH), E(DIM_NE) Shared memory END SUBROUTINE CSTRUCTFAC Dim:1, 2 or 3 size/s END INTERFACE Input/Outputs [...] DO II = 1, NR, NR_2 IND_H = (DIM2_H * (II - 1)) + 1 IND_E = (DIM2_E * (II - 1)) + 1 CALL CSTRUCTFAC(NA, NR_2, MAXATOMS, F2, DIM_NA, A, & DIM_NH/TASKS, H(IND_H : IND_H + (DIM_NH/TASKS) - 1),& DIM_NE/TASKS, E(IND_E : IND_E + (DIM_NE / TASKS) - 1)) END DO [...] Normal Fortran subroutine call 29
30 OpenCL in OmpsS: performance and scalability Performance/programmability trade-off: Performance competitive with hand-written code. Lower number of code lines and API calls. Matmul and Krist: lower LoC and API calls Matmul: better performance 30
31 OpenCL in OmpsS: performance and scalability Scalability results competitive with hand-written code Transparently uses the number of available devices If there is enough parallelism 31
32 Conclusions
33 Conclusions OmpSs is a programming model that enables Incremental parallelization of sequential code Data-flow execution model (asynchronous) Nicely supports heterogeneous environments Many optimizations under the hood: Advanced scheduling policies Work stealing/load balancing Data prefetching OmpSs is open source Have a look at 33
34 Thank you! For further information please contact 34
Tutorial OmpSs: Overlapping communication and computation
www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00
More informationProgramming with
Programming with OmpSs@CUDA/OpenCL Xavier Martorell, Rosa M. Badia Computer Sciences Research Dept. BSC PATC Parallel Programming Workshop July 1-2, 2015 Agenda Motivation Leveraging OpenCL and CUDA Examples
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationHybridizing MPI and tasking: The MPI+OmpSs experience. Jesús Labarta BSC CS Dept. Director
Hybridizing MPI and tasking: The MPI+OmpSs experience Jesús Labarta BSC CS Dept. Director Russian Supercomputing Days Moscow, September 25th, 2017 MPI + X Why hybrid? MPI is here to stay A lot of HPC applications
More informationOmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel
OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies
More informationOmpSs Specification. BSC Programming Models
OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................
More informationOpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs
www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE
More informationDesign and Development of support for GPU Unified Memory in OMPSS
Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat
More informationHeterogeneous Computing
OpenCL Hwansoo Han Heterogeneous Computing Multiple, but heterogeneous multicores Use all available computing resources in system [AMD APU (Fusion)] Single core CPU, multicore CPU GPUs, DSPs Parallel programming
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 April 2017 HW#8 will be posted Announcements HW#7 Power outage Pi Cluster Runaway jobs (tried
More informationFundamentals of OmpSs
www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationMultiGPU Made Easy by OmpSs + CUDA/OpenACC
www.bsc.es MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018 Introduction: Programming Models for GPU Computing CUD (Compute
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL
CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming
More informationBest GPU Code Practices Combining OpenACC, CUDA, and OmpSs
www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose
More informationOmpSs Examples and Exercises
OmpSs Examples and Exercises Release BSC Programming Models Mar 22, 2018 CONTENTS 1 Introduction 3 1.1 System configuration........................................... 3 1.2 Building the examples..........................................
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationIncremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010
Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Intel Cilk Plus OpenCL Übung, October 7, 2012 2 Intel Cilk
More informationPATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC
PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on
More informationAdvanced OpenMP. Other threading APIs
Advanced OpenMP Other threading APIs What s wrong with OpenMP? OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. - cannot
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems
www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre
More informationParallel Hybrid Computing Stéphane Bihan, CAPS
Parallel Hybrid Computing Stéphane Bihan, CAPS Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous hardware
More informationExtending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives
Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives Kevin Sala, X. Teruel, J. M. Perez, V. Beltran, J. Labarta 24/09/2018 OpenMPCon 2018, Barcelona Overview TAMPI Library
More informationParallel Programming Overview
Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator
More informationTask-based programming models to support hierarchical algorithms
www.bsc.es Task-based programming models to support hierarchical algorithms Rosa M BadiaBarcelona Supercomputing Center SHAXC 2016, KAUST, 11 May 2016 Outline BSC Overview of superscalar programming model
More informationA Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationOpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR
OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationAMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas
AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18
More informationHybrid/Heterogeneous Programming with StarSs. Jesus Labarta Director Computer Sciences Research Dept. BSC
Hybrid/Heterogeneous Programming with StarSs Jesus Labarta Director Computer Sciences Research Dept. BSC Index of the talk Introduction StarSs StarSs @ heterogenous nodes Hybrid MPI/StarSs Conclusion Jesus
More informationImproving the interoperability between MPI and OmpSs-2
Improving the interoperability between MPI and OmpSs-2 Vicenç Beltran Querol vbeltran@bsc.es 19/04/2018 INTERTWinE Exascale Application Workshop, Edinburgh Why hybrid MPI+OmpSs-2 programming? Gauss-Seidel
More informationSession 4: Parallel Programming with OpenMP
Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationHybrid MPI + OmpSs Advanced concept of OmpSs
www.bsc.es Hybrid MPI + OmpSs Advanced concept of OmpSs Rosa Badia, Xavier Martorell Session 2: Hybrid MPI + OmpSs and Load Balance! Hybrid programming Issues and basic examples! Load Balance! OmpSs advanced
More informationOpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen
OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT
More informationExploring Dynamic Parallelism on OpenMP
www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction
More informationOpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware
OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench
More informationProgramming paradigms for GPU devices
Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate
More informationGPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker
GPGPU IGAD 2014/2015 Lecture 1 Jacco Bikker Today: Course introduction GPGPU background Getting started Assignment Introduction GPU History History 3DO-FZ1 console 1991 History NVidia NV-1 (Diamond Edge
More informationLittle Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo
OpenMP Amasis Brauch German University in Cairo May 4, 2010 Simple Algorithm 1 void i n c r e m e n t e r ( short a r r a y ) 2 { 3 long i ; 4 5 for ( i = 0 ; i < 1000000; i ++) 6 { 7 a r r a y [ i ]++;
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationOpenMP tasking model for Ada: safety and correctness
www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing
More informationChip Multiprocessors COMP Lecture 9 - OpenMP & MPI
Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather
More informationADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA
ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationParallelization using the GPU
~ Parallelization using the GPU Scientific Computing Winter 2016/2017 Lecture 29 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 26 ~ Recap 2 / 26 MIMD Hardware: Distributed memory
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationIntroduction to Standard OpenMP 3.1
Introduction to Standard OpenMP 3.1 Massimiliano Culpo - m.culpo@cineca.it Gian Franco Marras - g.marras@cineca.it CINECA - SuperComputing Applications and Innovation Department 1 / 59 Outline 1 Introduction
More informationTowards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake
Towards Transparent and Efficient GPU Communication on InfiniBand Clusters Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake MPI and I/O from GPU vs. CPU Traditional CPU point-of-view
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationOmpSs-2 Specification
OmpSs-2 Specification Release BSC Programming Models Nov 29, 2018 CONTENTS 1 Introduction 1 1.1 License and liability disclaimer..................................... 1 1.2 A bit of history..............................................
More informationData Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016
Data Parallelism CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016 1 Goals Cover the material in Chapter 7 of Seven Concurrency Models in Seven Weeks by Paul Butcher Data Parallelism
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationInstructions for setting up OpenCL Lab
Instructions for setting up OpenCL Lab This document describes the procedure to setup OpenCL Lab for Linux and Windows machine. Specifically if you have limited no. of graphics cards and a no. of users
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationOpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system
OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationCOMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP
COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationOpenCL in Action. Ofer Rosenberg
pencl in Action fer Rosenberg Working with pencl API pencl Boot Platform Devices Context Queue Platform Query int GetPlatform (cl_platform_id &platform, char* requestedplatformname) { cl_uint numplatforms;
More informationIntroduction to parallel computing concepts and technics
Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing
More informationEnabling GPU support for the COMPSs-Mobile framework
Enabling GPU support for the COMPSs-Mobile framework Francesc Lordan, Rosa M Badia and Wen-Mei Hwu Nov 13, 2017 4th Workshop on Accelerator Programming Using Directives COMPSs-Mobile infrastructure WAN
More informationOpenCL. Computation on HybriLIT Brief introduction and getting started
OpenCL Computation on HybriLIT Brief introduction and getting started Alexander Ayriyan Laboratory of Information Technologies Joint Institute for Nuclear Research 05.09.2014 (Friday) Tutorial in frame
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationDesign Decisions for a Source-2-Source Compiler
Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationThe OmpSs programming model and its runtime support
www.bsc.es The OmpSs programming model and its runtime support Jesús Labarta BSC RoMoL 2016 Barcelona, March 12 th 2016 1 Vision The multicore and memory revolution ISA leak Plethora of architectures Heterogeneity
More informationOmpSs - programming model for heterogenous and distributed platforms
www.bsc.es OmpSs - programming model for heterogenous and distributed platforms Rosa M Badia Uppsala, 3 June 2013 Evolution of computers All include multicore or GPU/accelerators Parallel programming models
More informationITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...
ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 016 Solutions Name:... Answer questions in space provided below questions. Use additional paper if necessary but make sure
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationStarPU: a runtime system for multigpu multicore machines
StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President
4 th Annual Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 CPUs Multiple cores driving performance increases Emerging Intersection GPUs Increasingly
More informationIntroduction to OpenCL!
Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationIntroduction à OpenCL
1 1 UDS/IRMA Journée GPU Strasbourg, février 2010 Sommaire 1 OpenCL 2 3 GPU architecture A modern Graphics Processing Unit (GPU) is made of: Global memory (typically 1 Gb) Compute units (typically 27)
More informationOpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen
OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,
More informationINTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016
INTRODUCTION TO OPENACC Lecture 3: Advanced, November 9, 2016 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Course Syllabus Oct 26: Analyzing and Parallelizing with OpenACC
More informationObjective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC
GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationNeil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1
Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President Copyright Khronos Group, 2009 - Page 1 Introduction and aims of OpenCL - Neil Trevett, NVIDIA OpenCL Specification walkthrough - Mike
More informationPragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray
Pragma-based GPU Programming and HMPP Workbench Scott Grauer-Gray Pragma-based GPU programming Write programs for GPU processing without (directly) using CUDA/OpenCL Place pragmas to drive processing on
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationTools for Multi-Cores and Multi-Targets
Tools for Multi-Cores and Multi-Targets Sebastian Pop Advanced Micro Devices, Austin, Texas The Linux Foundation Collaboration Summit April 7, 2011 1 / 22 Sebastian Pop Tools for Multi-Cores and Multi-Targets
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More information