MultiGPU Made Easy by OmpSs + CUDA/OpenACC
|
|
- Lucinda Stafford
- 5 years ago
- Views:
Transcription
1 MultiGPU Made Easy by OmpSs + CUD/OpenCC ntonio J. Peña Sr. Researcher & ctivity Lead Manager, SC/UPC NVIDI GCoE San Jose 2018
2 Introduction: Programming Models for GPU Computing CUD (Compute Unified Device rchitecture) Runtime & Driver PIs (high-level / low-level) Specific for NVIDI GPUs: best performance & control OpenCC (Open ccelerators) Open Standard Higher-level, pragma-based iming at portability heterogeneous hardware For NVIDI GPUs, implemented on top of CUD OpenCL (Open Computing Language) Open Standard Low-level similar to CUD Driver PI Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL, ) 2
3 Motivation: Coding Productivity & Performance CUD OpenCC OpenCC + CUD OmpSs + CUD OmpSs + OpenCC OmpSs + OpenCC + CUD Don t get me wrong: CUD delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) colors here. Please interpret colors as relative to each other OpenCC may well deliver more than the performance you *need*. However, we have the lowest control on performance w.r.t. the discussed alternatives Coding Prod. / Perf. 3
4 EPEEC, an EU H2020 Project European joint Effort toward a Highly Productive Programming Environment for Heterogeneous Exascale Computing FETHPC, 3 years, ~4M, Starting October 2018 Subtopic: High productivity programming environments for exascale 10 Partners; Coordinator: SC (I m the Technical Manager) High-level Objectives: Develop & deploy a production-ready parallel programming environment dvance and integrate existing state-of-the-art European technology High coding productivity, high performance, energy awareness
5 Proposed Methodology for pplication Developers utomatic Code nnotation Satisfactory Performance? No Profile Directive Optimisation Possible? No Yes Update code patterns Yes Tune/Insert Directives Manually Code Low-Level ccelerator Kernels No Yes Satisfactory Code Patterns? No Satisfactory Performance? Start Deploy Yes 5
6 OmpSs + CUD / OpenCC
7 OmpSs Main Program Sequential control flow Defines a single address space Executes sequential code that Can spawn/instantiate tasks that will be executed sometime in the future Can stall/wait for tasks Tasks annotated with directionality clauses in, out, inout Used To build dependences among tasks For main to wait for data to be produced asis for memory management functionalities (replication, locality, movement, ) Copy clauses 7
8 OmpSs: Sequential Program void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) spotrf ([k*nt+k]) ; for (i=k+1; i<nt; i++) NT NT TS TS TS TS strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) sgemm( [k][i], [k][j], [j][i]); ssyrk ([k][i], [i][i]); 8
9 OmpSs: with Directionality nnotations void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) #pragma omp task inout ([k][k]) spotrf ([k][k]) ; for (i=k+1; i<nt; i++) #pragma omp task in ([k][k]) inout ([k][i]) strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) #pragma omp task in ([k][i], [k][j]) inout ([j][i]) sgemm( [k][i], [k][j], [j][i]); #pragma omp task in ([k][i]) inout ([i][i]) ssyrk ([k][i], [i][i]); NT NT TS TS TS TS 9
10 OmpSs: that Happens to Execute in Parallel void Cholesky( float *[NT][NT] ) int i, j, k; for (k=0; k<nt; k++) #pragma omp task inout ([k][k]) spotrf ([k][k]) ; for (i=k+1; i<nt; i++) #pragma omp task in ([k][k]) inout ([k][i]) strsm ([k][k], [k][i]); for (i=k+1; i<nt; i++) for (j=k+1; j<i; j++) #pragma omp task in ([k][i], [k][j]) inout ([j][i]) sgemm( [k][i], [k][j], [j][i]); #pragma omp task in ([k][i]) inout ([i][i]) ssyrk ([k][i], [i][i]); NT NT TS TS TS TS Decouple how we write/think (sequential) from how it is executed 10
11 Memory Consistency (Getting Consistent Copies) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T1 needs a valid copy of array in the device lso it allocates array in the device (no copy needed), and invalidates other s DEVICE T1 Task Dependency Graph T2 T1 T3 Memory Transfers No need to copy HOST C 11
12 Memory Consistency (Reusing Data in Place) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T2 can reuse arrays and, due they have been used by previous task (T1) dditionally it also invalidates others s DEVICE T2 Task Dependency Graph T2 T1 T3 Memory Transfers HOST C 12
13 Memory Consistency (on Demand Copy Data ack) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c T3 needs to copy back to the host array Does not invalidate the existing copy in the device DEVICE Task Dependency Graph T2 T1 T3 Memory Transfers HOST C T3 13
14 Memory Consistency (Centralized Memory Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait // can access any of,,c Taskwait requires full memory consistency in the host T1 T2 T3 TW DEVICE Task Dependency Graph T2 T1 T3 Memory Transfers HOST C 14
15 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c Taskwait is waiting for task finalization, but does not copy memory back to the host (neither invalidate it) T1 T2 T3 DEVICE Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 15
16 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c efore executing T4 it will need a consistent copy of C and it will also invalidate all previous versions of T1 T2 T3 T4 DEVICE C Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 16
17 Memory Consistency (void taskwait Consistency) Relaxed-consistency shared-memory model (OpenMP-like) #pragma omp target device (cuda) void scale_task_cuda(double *b, double *c, double a, int N) int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) void scale_task_host(double *b, double *c, double a, int N) for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) scale_task_cuda (,, 10.0, 1024); //T1 scale_task_cuda (,, 0.01, 1024); //T2 scale_task_host (C,, 2.00, 1024); //T3 #pragma omp taskwait noflush // does not flush data dev -> host scale_task_cuda (, C, 3.00, 1024); //T4 #pragma omp taskwait // can access any of,,c Taskwait waits for tasks finalization, it will invalidate all data versions and force memory consistency T1 T2 T3 T4 TW DEVICE C Task Dependency Graph T2 T1 T4 T3 Memory Transfers noflush HOST C nf 17
18 OmpSs + CUD Example: XPY lgorithm 1 Port kernel to CUD 2 nnotate device (cuda) 3 Complete device (smp) #include <kernel.h> int main(int argc, char *argv[]) float a=5, x[n], y[n]; // Initialize values for (int i=0; i<n; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.c #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy(int n, float a, float* x, float* y); void saxpy(int n, float a, float *X, float *Y) for (int i=0; i<n; ++i) Y[i] = X[i] * a + Y[i]; 2 #pragma omp target device(cuda) copy_deps ndrange(1,n,128) #pragma omp task in([n]x) inout([n]y) global void saxpy(int n, float a, float* x, float* y); kernel.h kernel.c kernel.cuh 1 kernel.cu global void saxpy(int n, float a, float* x, float* y) int i = blockidx.x * blockdim.x + threadidx.x; if(i < n) y[i] = a * x[i] + y[i]; 3 So easy! Difficult for non-experienced programmers! 18
19 OmpSs + OpenCC: Motivation What if we could use OpenCC directives with OmpSs? OpenCC is way easier than CUD Instead of porting & optimizing many CUD tasks port every GPU-accelerated task using using OpenCC and only use CUD where the OpenCC compiler doesn t provide the required efficiency 19
20 OmpSs + OpenCC Example: SXPY lgorithm 1 Port kernel to CUD 2 nnotate device (openacc) 3 Complete device (smp) #include <kernel.h> int main(int argc, char *argv[]) float a=5, x[n], y[n]; // Initialize values for (int i=0; i<n; ++i) x[i] = y[i] = i; // Compute saxpy algorithm (1 task) saxpy(n, a, x, y); #pragma omp taskwait //Check results for (int i=0; i<n; ++i) if (y[i]!=a*i+i) perror("error\n") message("results are correct\n"); main.c #pragma omp target device(smp) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y); void saxpy(int n, float a, float *x, float *y) for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; #pragma omp target device(openacc) copy_deps #pragma omp task in([n]x) inout([n]y) void saxpy (int n, float a, float* x, float* y); void saxpy(int n, float a, float *x, float *y) #pragma acc kernels for (int i=0; i<n; ++i) y[i] = x[i] * a + y[i]; So easy! 3 2 kernel.h kernel.c So easy! kernel.h kernel.c So easy! 20
21 FWI Full Wave Inversion Oil & Gas Miniapplication nalyzes physical properties of the subsoil from seismic measures Elastic wave propagator + linearly elastic stress-strain relationships Six different stress components Finite differences (FD) method with a Fully Staggered Grid (FSG) ase code developed by the SC Repsol Team 21
22 3,27 0,97 1,13 1,29 Speedup 5,68 5,75 5,84 7,18 6,25 12,15 14,96 13,44 15,95 13,29 16,52 13,46 16,57 12,37 17,47 19,08 18,18 FWI Parallelization OmpSs/OpenCC - Results FWI Speedups aseline: OpenMP 25,00 20,00 15,00 10,00 5,00 0,00 1,00 1,00 i7-5930k (6c) Tesla K40 (Kepler) Titan X (Maxwell) Titan X (Pascal) 22
23 Some nnouncements L8116 est GPU Code Practices Combining OpenCC, CUD, and OmpSs Thu. 10:00-12:00 S8328 One More Step Towards the Simulation of the Human rain on NVIDI GPUs (HP) Thu. 4:00-4:25pm Join Upcoming EPEEC s Users Group STRS Open Postdoctoral Fellowships PUMPS+I Summer School arcelona, July Featuring Wen-mei Hwu && David Kirk dvanced CUD + rand-new +I format! antonio.pena@bsc.es
24 cknowledgements Guray Ozen First OmpSs+OpenCC prototype ccelerators and Communications for HPC Team my team Core: Pau Farré, Marc Jordà, Kyunghun Kim, Mohammad Owais Collaborators: Pedro Valero, imar Rodríguez, Jan Ciesko OmpSs Team wesome programming moldel and runtime Xavier Martorell, Vicenç eltran, Xavier Teruel, Sergi Mateo, JM Perez, SC Repsol Team Providing original FWI implementation Maurizio Hanzich, Samuel Rodríguez, PUMPS Summer School Wen-mei Hwu & Ts: Simón García de Gonzalo, bdul Dakkak, Carl Pearson, Mert Hiyadetoglu David Kirk, Juan Gómez-Luna Pau Farré Jr. Research Engineer 24
25 Thank you! For further information please contact
Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs
www.bsc.es Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña Munich, Oct. 12 2017 PROLOGUE Barcelona Supercomputing Center Marenostrum 4 13.7 PetaFlop/s General Purpose
More informationOmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel
OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies
More informationOmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel
www.bsc.es OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray Ozen guray.ozen@bsc.es Exascale in BSC Marenostrum 4 (13.7 Petaflops ) General purpose cluster (3400
More informationDesign and Development of support for GPU Unified Memory in OMPSS
Design and Development of support for GPU Unified Memory in OMPSS Master in Innovation and Research in Informatics (MIRI) High Performance Computing (HPC) Facultat d Informàtica de Barcelona (FIB) Universitat
More informationAsynchronous Task Creation for Task-Based Parallel Programming Runtimes
Asynchronous Task Creation for Task-Based Parallel Programming Runtimes Jaume Bosch (jbosch@bsc.es), Xubin Tan, Carlos Álvarez, Daniel Jiménez, Xavier Martorell and Eduard Ayguadé Barcelona, Sept. 24,
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems
www.bsc.es ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es Training BSC/UPC GPU Centre
More informationExploring Dynamic Parallelism on OpenMP
www.bsc.es Exploring Dynamic Parallelism on OpenMP Guray Ozen, Eduard Ayguadé, Jesús Labarta WACCPD @ SC 15 Guray Ozen - Exploring Dynamic Parallelism in OpenMP Austin, Texas 2015 MACC: MACC: Introduction
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationOpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs
www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE
More informationProgramming model and application porting to the Dynamical Exascale Entry Platform (DEEP)
Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP) EASC 2013 April 10 th, Edinburgh Damián A. Mallón The research leading to these results has received funding from
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationPATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC
PATC Training OmpSs and GPUs support Xavier Martorell Programming Models / Computer Science Dept. BSC May 23 rd -24 th, 2012 Outline Motivation OmpSs Examples BlackScholes Perlin noise Julia Set Hands-on
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationProgramming paradigms for GPU devices
Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate
More informationINTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies
INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC
More informationObjective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC
GPU Teaching Kit Accelerated Computing OpenACC Introduction to OpenACC Objective To understand the OpenACC programming model basic concepts and pragma types simple examples 2 2 OpenACC The OpenACC Application
More informationINTRODUCTION TO OPENACC
INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationIntroduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University
Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:
More informationHybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda
www.bsc.es Hybrid OmpSs OmpSs, MPI and GPUs Guillermo Miranda guillermo.miranda@bsc.es Trondheim, 3 October 2013 Outline OmpSs Programming Model StarSs Comparison with OpenMP Heterogeneity (OpenCL and
More informationIntroduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA
Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationINTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC
INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationOpenACC Fundamentals. Steve Abbott November 15, 2017
OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)
More informationIncremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010
Innovative software for manycore paradigms Incremental Migration of C and Fortran Applications to GPGPU using HMPP HPC Advisory Council China Conference 2010 Introduction Many applications can benefit
More informationDesign Decisions for a Source-2-Source Compiler
Design Decisions for a Source-2-Source Compiler Roger Ferrer, Sara Royuela, Diego Caballero, Alejandro Duran, Xavier Martorell and Eduard Ayguadé Barcelona Supercomputing Center and Universitat Politècnica
More informationADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES. Jeff Larkin, NVIDIA
ADVANCED ACCELERATED COMPUTING USING COMPILER DIRECTIVES Jeff Larkin, NVIDIA OUTLINE Compiler Directives Review Asynchronous Execution OpenACC Interoperability OpenACC `routine` Advanced Data Directives
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationParallel Programming Overview
Parallel Programming Overview Introduction to High Performance Computing 2019 Dr Christian Terboven 1 Agenda n Our Support Offerings n Programming concepts and models for Cluster Node Core Accelerator
More informationBlue Waters Programming Environment
December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationTask-parallel reductions in OpenMP and OmpSs
Task-parallel reductions in OpenMP and OmpSs Jan Ciesko 1 Sergi Mateo 1 Xavier Teruel 1 Vicenç Beltran 1 Xavier Martorell 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {jan.ciesko,sergi.mateo,xavier.teruel,
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationTowards task-parallel reductions in OpenMP
www.bsc.es Towards task-parallel reductions in OpenMP J. Ciesko, S. Mateo, X. Teruel, X. Martorell, E. Ayguadé, J. Labarta, A. Duran, B. De Supinski, S. Olivier, K. Li, A. Eichenberger IWOMP - Aachen,
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationThe Role of Standards in Heterogeneous Programming
The Role of Standards in Heterogeneous Programming Multi-core Challenge Bristol UWE 45 York Place, Edinburgh EH1 3HP June 12th, 2013 Codeplay Software Ltd. Incorporated in 1999 Based in Edinburgh, Scotland
More informationOpenMP Fundamentals Fork-join model and data environment
www.bsc.es OpenMP Fundamentals Fork-join model and data environment Xavier Teruel and Xavier Martorell Agenda: OpenMP Fundamentals OpenMP brief introduction The fork-join model Data environment OpenMP
More informationProgramming with
Programming with OmpSs@CUDA/OpenCL Xavier Martorell, Rosa M. Badia Computer Sciences Research Dept. BSC PATC Parallel Programming Workshop July 1-2, 2015 Agenda Motivation Leveraging OpenCL and CUDA Examples
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationHardware Hetergeneous Task Scheduling for Task-based Programming Models
www.bsc.es Hardware Hetergeneous Task Scheduling for Task-based Programming Models Xubin Tan OpenMPCon 2018 Advisors: Carlos Álvarez, Daniel Jiménez-González Agenda > Background, Motivation > Picos++ accelerated
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationOmpSs Specification. BSC Programming Models
OmpSs Specification BSC Programming Models March 30, 2017 CONTENTS 1 Introduction to OmpSs 3 1.1 Reference implementation........................................ 3 1.2 A bit of history..............................................
More informationGPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation
GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application
More informationA Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell
More informationCOMP Parallel Computing. Programming Accelerators using Directives
COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator
More informationTutorial OmpSs: Overlapping communication and computation
www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00
More informationTask-based programming models to support hierarchical algorithms
www.bsc.es Task-based programming models to support hierarchical algorithms Rosa M BadiaBarcelona Supercomputing Center SHAXC 2016, KAUST, 11 May 2016 Outline BSC Overview of superscalar programming model
More informationHMPP port. G. Colin de Verdière (CEA)
HMPP port G. Colin de Verdière (CEA) Overview.Uchu prototype HMPP MOD2AS MOD2AM HMPP in a real code 2 The UCHU prototype Bull servers 1 login node 4 nodes 2 Haperton, 8GB 2 NVIDIA Tesla S1070 IB DDR Slurm
More informationFundamentals of OmpSs
www.bsc.es Fundamentals of OmpSs Tasks and Dependences Xavier Teruel New York, June 2013 AGENDA: Fundamentals of OmpSs Tasking and Synchronization Data Sharing Attributes Dependence Model Other Tasking
More informationBigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu
Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters
More informationKepler Overview Mark Ebersole
Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090
More informationGPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler
GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs
More informationTechnology on Dense Linear Algebra
Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,
More informationOpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen
OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationOpenACC Fundamentals. Steve Abbott November 13, 2016
OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire
More informationLECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016
LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized
More informationIntroduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2018 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Aside (P3 related): linear algebra Many scientific phenomena can be modeled as matrix operations Differential
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationGPU CUDA Programming
GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications
More informationA case study of performance portability with OpenMP 4.5
A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW
More informationOpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016
OpenACC. Part 2 Ned Nedialkov McMaster University Canada CS/SE 4F03 March 2016 Outline parallel construct Gang loop Worker loop Vector loop kernels construct kernels vs. parallel Data directives c 2013
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationGPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA
GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles
More informationUnified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association
Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More informationQuad Doubles on a GPU
Quad Doubles on a GPU 1 Floating-Point Arithmetic floating-point numbers quad double arithmetic quad doubles for use in CUDA programs 2 Quad Double Square Roots quad double arithmetic on a GPU a kernel
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationDirective-based Programming for Highly-scalable Nodes
Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism
More informationOpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven
OpenMP Tutorial Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center, RWTH Aachen University Head of the HPC Group terboven@itc.rwth-aachen.de 1 Tasking
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationUsing a GPU in InSAR processing to improve performance
Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics
More informationOpenMP and GPU Programming
OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi https://github.com/eruffaldi/course_openmpgpu PERCeptual RObotics Laboratory, TeCIP Scuola Superiore Sant Anna Pisa,Italy e.ruffaldi@sssup.it April
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationEarly Experiences With The OpenMP Accelerator Model
Early Experiences With The OpenMP Accelerator Model Chunhua Liao 1, Yonghong Yan 2, Bronis R. de Supinski 1, Daniel J. Quinlan 1 and Barbara Chapman 2 1 Center for Applied Scientific Computing, Lawrence
More informationOpenACC Course Lecture 1: Introduction to OpenACC September 2015
OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationOpenMP Tasking Model Unstructured parallelism
www.bsc.es OpenMP Tasking Model Unstructured parallelism Xavier Teruel and Xavier Martorell What is a task in OpenMP? Tasks are work units whose execution may be deferred or it can be executed immediately!!!
More informationHeterogeneous Multicore Parallel Programming
Innovative software for manycore paradigms Heterogeneous Multicore Parallel Programming S. Chauveau & L. Morin & F. Bodin Introduction Numerous legacy applications can benefit from GPU computing Many programming
More informationOpenMP API Version 5.0
OpenMP API Version 5.0 (or: Pretty Cool & New OpenMP Stuff) Michael Klemm Chief Executive Officer OpenMP Architecture Review Board michael.klemm@openmp.org Architecture Review Board The mission of the
More informationAn innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.
An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ. Of Pisa Italy 29/02/2012, Nuremberg, Germany ARTEMIS ARTEMIS Joint Joint Undertaking
More informationSPOC : GPGPU programming through Stream Processing with OCaml
SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More information