MultiGPU Made Easy by OmpSs + CUDA/OpenACC

Size: px
Start display at page:

Download "MultiGPU Made Easy by OmpSs + CUDA/OpenACC"

Transcription

1 MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018

2 Introduction: Programming Models for GPU Computing CUD (Compute Unified Device rchitecture) Runtime & Driver PIs (high-level / low-level) Specific for NVIDI GPUs: best performance & control OpenCC (Open ccelerators) Open Standard Higher-level, pragma-based iming at portability heterogeneous hardware For NVIDI GPUs, implemented on top of CUD OpenCL (Open Computing Language) Open Standard Low-level similar to CUD Driver PI Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL, ) 2

3 Motivation: Coding Productivity & Performance CUD OpenCC OpenCC + CUD OmpSs + CUD OmpSs + OpenCC OmpSs + OpenCC + CUD Don t get me wrong: CUD delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) colors here. Please interpret colors as relative to each other OpenCC may well deliver more than the performance you *need*. However, we have the lowest control on performance w.r.t. the discussed alternatives Coding Prod. / Perf. 3

4 EPEEC, an EU H2020 Project European joint Effort toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing FETHPC, 3 years, ~4M, Starting October 2018 Subtopic: High productivity programming environments for exascale 10 Partners; Coordinator: SC (I m the Technical Manager) High-level Objectives: Develop & deploy a production-ready parallel programming environment dvance and integrate existing state-of-the-art European technology High coding productivity, high performance, energy awareness

5 Proposed Methodology for pplication Developers utomatic Code nnotation Satisfactory Performance? No Profile Directive Optimisation Possible? No Yes Update code patterns Yes Tune/Insert Directives Manually Code Low-Level ccelerator Kernels No Yes Satisfactory Code Patterns? No Satisfactory Performance? Start Deploy Yes 5

6 OmpSs + CUD / OpenCC

7 OmpSs Main Program Sequential control flow Defines a single address space Executes sequential code that Can spawn/instantiate tasks that will be executed sometime in the future Can stall/wait for tasks Tasks annotated with directionality clauses in, out, inout Used To build dependences among tasks For main to wait for data to be produced asis for memory management functionalities (replication, locality, movement, ) Copy clauses 7

8 OmpSs: Sequential Program void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) spotrf ([k*nt+k]) ; for (i=k+1; i<nt; i++) NT NT TS TS TS TS strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) sgemm( [k][i], [k][j], [j][i]); ssyrk ([k][i], [i][i]); 8

9 OmpSs: with Directionality nnotations void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) #pragma omp task inout ([k][k]) spotrf ([k][k]) ; for (i=k+1; i<nt; i++) #pragma omp task in ([k][k]) inout ([k][i]) strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) #pragma omp task in ([k][i], [k][j]) inout ([j][i]) sgemm( [k][i], [k][j], [j][i]); #pragma omp task in ([k][i]) inout ([i][i]) ssyrk ([k][i], [i][i]); NT NT TS TS TS TS 9

10 OmpSs: that Happens to Execute in Parallel void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) #pragma omp task inout ([k][k]) spotrf ([k][k]) ; for (i=k+1; i<nt; i++) #pragma omp task in ([k][k]) inout ([k][i]) strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) #pragma omp task in ([k][i], [k][j]) inout ([j][i]) sgemm( [k][i], [k][j], [j][i]); #pragma omp task in ([k][i]) inout ([i][i]) ssyrk ([k][i], [i][i]); NT NT TS TS TS TS Decouple how we write/think (sequential) from how it is executed 10

11 Memory Consistency (Getting Consistent Copies) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T1 needs a valid copy of array in the device lso it allocates array in the device (no copy needed), and invalidates other s DEVICE T1 Task Dependency Graph T2 T1 T3 Memory Transfers No need to copy HOST C 11

12 Memory Consistency (Reusing Data in Place) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T2 can reuse arrays and, due they have been used by previous task (T1) dditionally it also invalidates others s DEVICE T2 Task Dependency Graph T2 T1 T3 Memory Transfers HOST C 12

13 Memory Consistency (on Demand Copy Data ack) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T3 needs to copy back to the host array Does not invalidate the existing copy in the device DEVICE Task Dependency Graph T2 T1 T3 Memory Transfers HOST C T3 13

14 Memory Consistency (Centralized Memory Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c Taskwait requires full memory consistency in the host T1 T2 T3 TW DEVICE Task Dependency Graph T2 T1 T3 Memory Transfers HOST C 14

15 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c Taskwait is waiting for task finalization, but does not copy memory back to the host (neither invalidate it) T1 T2 T3 DEVICE Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 15

16 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c efore executing T4 it will need a consistent copy of C and it will also invalidate all previous versions of T1 T2 T3 T4 DEVICE C Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 16

17 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c Taskwait waits for tasks finalization, it will invalidate all data versions and force memory consistency T1 T2 T3 T4 TW DEVICE C Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 17

18 OmpSs + CUD Example: XPY lgorithm 1 Port kernel to CUD 2 nnotate device (cuda) 3 Complete device (smp) #include <kernel.h> int main(int argc, char *argv[]) float a=5, x[n], y[n]; // Initialize values for (int i=0; i<n; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.c #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y); void saxpy(int n, float a, float *X, float *Y) for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) global void saxpy(int n, float a, float* x, float* y); kernel.h kernel.c kernel.cuh 1 kernel.cu global void saxpy(int n, float a, float* x, float* y) int i = blockidx.x * blockdim.x + threadidx.x; if(i < n) y[i] = a * x[i] + y[i]; 3 So easy! Difficult for non-experienced programmers! 18

19 OmpSs + OpenCC: Motivation What if we could use OpenCC directives with OmpSs? OpenCC is way easier than CUD Instead of porting & optimizing many CUD tasks port every GPU-accelerated task using using OpenCC and only use CUD where the OpenCC compiler doesn t provide the required efficiency 19

20 OmpSs + OpenCC Example: SXPY lgorithm 1 Port kernel to CUD 2 nnotate device (openacc) 3 Complete device (smp) #include <kernel.h> int main(int argc, char *argv[]) float a=5, x[n], y[n]; // Initialize values for (int i=0; i<n; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.c #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y); void saxpy(int n, float a, float *x, float *y) for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; #pragma omp target device(openacc) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y); void saxpy(int n, float a, float *x, float *y) #pragma acc kernels for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; So easy! 3 2 kernel.h kernel.c So easy! kernel.h kernel.c So easy! 20

21 FWI Full Wave Inversion Oil & Gas Miniapplication nalyzes physical properties of the subsoil from seismic measures Elastic wave propagator + linearly elastic stress-strain relationships Six different stress components Finite differences (FD) method with a Fully Staggered Grid (FSG) ase code developed by the SC Repsol Team 21

22 3,27 0,97 1,13 1,29 Speedup 5,68 5,75 5,84 7,18 6,25 12,15 14,96 13,44 15,95 13,29 16,52 13,46 16,57 12,37 17,47 19,08 18,18 FWI Parallelization OmpSs/OpenCC - Results FWI Speedups aseline: OpenMP 25,00 20,00 15,00 10,00 5,00 0,00 1,00 1,00 i7-5930k (6c) Tesla K40 (Kepler) Titan X (Maxwell) Titan X (Pascal) 22

23 Some nnouncements L8116 est GPU Code Practices Combining OpenCC, CUD, and OmpSs Thu. 10:00-12:00 S8328 One More Step Towards the Simulation of the Human rain on NVIDI GPUs (HP) Thu. 4:00-4:25pm Join Upcoming EPEEC s Users Group STRS Open Postdoctoral Fellowships PUMPS+I Summer School arcelona, July Featuring Wen-mei Hwu && David Kirk dvanced CUD + rand-new +I format! antonio.pena@bsc.es

24 cknowledgements Guray Ozen First OmpSs+OpenCC prototype ccelerators and Communications for HPC Team my team Core: Pau Farré, Marc Jordà, Kyunghun Kim, Mohammad Owais Collaborators: Pedro Valero, imar Rodríguez, Jan Ciesko OmpSs Team wesome programming moldel and runtime Xavier Martorell, Vicenç eltran, Xavier Teruel, Sergi Mateo, JM Perez, SC Repsol Team Providing original FWI implementation Maurizio Hanzich, Samuel Rodríguez, PUMPS Summer School Wen-mei Hwu & Ts: Simón García de Gonzalo, bdul Dakkak, Carl Pearson, Mert Hiyadetoglu David Kirk, Juan Gómez-Luna Pau Farré Jr. Research Engineer 24

25 Thank you! For further information please contact

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose

More information

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies

More information

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400

More information

Design and Development of support for GPU Unified Memory in OMPSS

Design and Development of support for GPU Unified Memory in OMPSS Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat

More information

Asynchronous Task Creation for Task-Based Parallel Programming Runtimes

Asynchronous Task Creation for Task-Based Parallel Programming Runtimes Asynchronous Task Creation for Task-Based Parallel Programming Runtimes Jaume Bosch (jbosch@bsc.es), Xubin Tan, Carlos Álvarez, Daniel Jiménez, Xavier Martorell and Eduard Ayguadé Barcelona, Sept. 24,

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre

More information

Exploring Dynamic Parallelism on OpenMP

Exploring Dynamic Parallelism on OpenMP www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP)

Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP) Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP) EASC 2013 April 10 th, Edinburgh Damián A. Mallón The research leading to these results has received funding from

More information

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011 MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC

More information

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application

More information

INTRODUCTION TO OPENACC

INTRODUCTION TO OPENACC INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda

Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda www.bsc.es Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda guillermo.miranda@bsc.es Trondheim, 3 October 2013 Outline OmpSs Programming Model StarSs Comparison with OpenMP Heterogeneity (OpenCL and

More information

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010

Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit

More information

Design Decisions for a Source-2-Source Compiler

Design Decisions for a Source-2-Source Compiler Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica

More information

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA

ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Parallel Programming Overview

Parallel Programming Overview Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator

More information

Blue Waters Programming Environment

Blue Waters Programming Environment December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Task-parallel reductions in OpenMP and OmpSs

Task-parallel reductions in OpenMP and OmpSs Task-parallel reductions in OpenMP and OmpSs Jan Ciesko 1 Sergi Mateo 1 Xavier Teruel 1 Vicenç Beltran 1 Xavier Martorell 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {jan.ciesko,sergi.mateo,xavier.teruel,

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

Towards task-parallel reductions in OpenMP

Towards task-parallel reductions in OpenMP www.bsc.es Towards task-parallel reductions in OpenMP J. Ciesko, S. Mateo, X. Teruel, X. Martorell, E. Ayguadé, J. Labarta, A. Duran, B. De Supinski, S. Olivier, K. Li, A. Eichenberger IWOMP - Aachen,

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

The Role of Standards in Heterogeneous Programming

The Role of Standards in Heterogeneous Programming The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland

More information

OpenMP Fundamentals Fork-join model and data environment

OpenMP Fundamentals Fork-join model and data environment www.bsc.es OpenMP Fundamentals Fork-join model and data environment Xavier Teruel and Xavier Martorell Agenda: OpenMP Fundamentals OpenMP brief introduction The fork-join model Data environment OpenMP

More information

Programming with

Programming with Programming with OmpSs@CUDA/OpenCL Xavier Martorell, Rosa M. Badia Computer Sciences Research Dept. BSC PATC Parallel Programming Workshop July 1-2, 2015 Agenda Motivation Leveraging OpenCL and CUDA Examples

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Hardware Hetergeneous Task Scheduling for Task-based Programming Models

Hardware Hetergeneous Task Scheduling for Task-based Programming Models www.bsc.es Hardware Hetergeneous Task Scheduling for Task-based Programming Models Xubin Tan OpenMPCon 2018 Advisors: Carlos Álvarez, Daniel Jiménez-González Agenda > Background, Motivation > Picos++ accelerated

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

OmpSs Specification. BSC Programming Models

OmpSs Specification. BSC Programming Models OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................

More information

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application

More information

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

Tutorial OmpSs: Overlapping communication and computation

Tutorial OmpSs: Overlapping communication and computation www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00

More information

Task-based programming models to support hierarchical algorithms

Task-based programming models to support hierarchical algorithms www.bsc.es Task-based programming models to support hierarchical algorithms Rosa M BadiaBarcelona Supercomputing Center SHAXC 2016, KAUST, 11 May 2016 Outline BSC Overview of superscalar programming model

More information

HMPP port. G. Colin de Verdière (CEA)

HMPP port. G. Colin de Verdière (CEA) HMPP port G. Colin de Verdière (CEA) Overview.Uchu prototype HMPP MOD2AS MOD2AM HMPP in a real code 2 The UCHU prototype Bull servers 1 login node 4 nodes 2 Haperton, 8GB 2 NVIDIA Tesla S1070 IB DDR Slurm

More information

Fundamentals of OmpSs

Fundamentals of OmpSs www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking

More information

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters

More information

Kepler Overview Mark Ebersole

Kepler Overview Mark Ebersole Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

Technology on Dense Linear Algebra

Technology on Dense Linear Algebra Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016 LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2018 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Aside (P3 related): linear algebra Many scientific phenomena can be modeled as matrix operations Differential

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

A case study of performance portability with OpenMP 4.5

A case study of performance portability with OpenMP 4.5 A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW

More information

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016 OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Quad Doubles on a GPU

Quad Doubles on a GPU Quad Doubles on a GPU 1 Floating-Point Arithmetic floating-point numbers quad double arithmetic quad doubles for use in CUDA programs 2 Quad Double Square Roots quad double arithmetic on a GPU a kernel

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Directive-based Programming for Highly-scalable Nodes

Directive-based Programming for Highly-scalable Nodes Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism

More information

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven OpenMP Tutorial Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center, RWTH Aachen University Head of the HPC Group terboven@itc.rwth-aachen.de 1 Tasking

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Using a GPU in InSAR processing to improve performance

Using a GPU in InSAR processing to improve performance Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics

More information

OpenMP and GPU Programming

OpenMP and GPU Programming OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi https://github.com/eruffaldi/course_openmpgpu PERCeptual RObotics Laboratory, TeCIP Scuola Superiore Sant Anna Pisa,Italy e.ruffaldi@sssup.it April

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Early Experiences With The OpenMP Accelerator Model

Early Experiences With The OpenMP Accelerator Model Early Experiences With The OpenMP Accelerator Model Chunhua Liao 1, Yonghong Yan 2, Bronis R. de Supinski 1, Daniel J. Quinlan 1 and Barbara Chapman 2 1 Center for Applied Scientific Computing, Lawrence

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

OpenMP Tasking Model Unstructured parallelism

OpenMP Tasking Model Unstructured parallelism www.bsc.es OpenMP Tasking Model Unstructured parallelism Xavier Teruel and Xavier Martorell What is a task in OpenMP? Tasks are work units whose execution may be deferred or it can be executed immediately!!!

More information

Heterogeneous Multicore Parallel Programming

Heterogeneous Multicore Parallel Programming Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming

More information

OpenMP API Version 5.0

OpenMP API Version 5.0 OpenMP API Version 5.0 (or: Pretty Cool & New OpenMP Stuff) Michael Klemm Chief Executive Officer OpenMP Architecture Review Board michael.klemm@openmp.org Architecture Review Board The mission of the

More information

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ. An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ. Of Pisa Italy 29/02/2012, Nuremberg, Germany ARTEMIS ARTEMIS Joint Joint Undertaking

More information

SPOC : GPGPU programming through Stream Processing with OCaml

SPOC : GPGPU programming through Stream Processing with OCaml SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information