GPU Programming Paradigms

Size: px
Start display at page:

Download "GPU Programming Paradigms"

Transcription

1 GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke ( ) 1

2 Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows 7, Nvidia GeForce GT 220 hpc-orlando: Windows 7, ATI Radeon 4650 Intel GHz, 8 GiB RAM Upcoming: Tesla S1070 1U Rack Box: 4 Tesla T10 GPUs, 16 GiB RAM Tesla S20x0 1U Rack Box: 4 Fermi GPUs, 12/24 GiB RAM Connected to Nehalem Servers: 2 x Intel X5570@2.93 GHz Future: Visualization Cluster powering the new CAVE, Fermi graphics Available for HPC in batch mode at night/weekends 2

3 CUBLAS & CUFFT BLAS (Basic Linear Algebra Subprograms) and FFT (Fast Fourier Transform): popular interfaces / algorithms for numerical computation CUBLAS / CUFFT offload the computation onto the GPU, but are available as headers + libraries to be used like usual Potential to offload computation and harness GPU power without having to re-structure your program and re-code you algorithms in a painful way Nvidia documentation available for both libraries If you make heavy use of BLAS or FFT => try it out! 3

4 PGI: Overview Portland Group: Compiler: pgcc, pgfortran GPU activities CUDA Fortran PGI Accelerator Programming Model Works only on Nvidia GPUs Commercial Compiler Product (License needed) Available on our Linux cluster in several versions, including the most recent (10.3) 4

5 PGI CUDA FORTRAN 5

6 Overview CUDA is an architecture and a programming model, accessible to C programmers via C for CUDA, a C extension Likewise: CUDA Fortran makes the CUDA model accessible to Fortran programmers Same level of abstraction Compiler support necessary: CUDA Fortran is an extension to Fortran Developed by The Portland Group and only available with the PGI compilers 6

7 Example: Overall Structure module calcpi_mod use cudafor contains real attributes(device) function f(a) [ ] end function f attributes(global) subroutine partialsum (n, nbrthreads, res) [ ] end subroutine partialsum real function calcpi (n, blockspergrid, threadsperblock) [ ] end function CalcPi end module calcpi_mod progam Pi use calcpi_mod [ ] 7

8 Example: Kernel code attributes(global) subroutine partialsum (n, nbrthreads, res) implicit none integer, intent(in), value :: n, nbrthreads real, intent(out), device :: res(nbrthreads) real :: fh, fsum, fx integer :: i, idx fh = 1.0 / real(n) fsum = 0.0 idx = blockdim%x * (blockidx%x - 1 ) + threadidx%x do i = idx - 1, n - 1, nbrthreads fx = fh * (real(i) + 0.5) fsum = fsum + f(fx) res(idx) = fh * fsum end subroutine partialsum 8

9 Example: Kernel call [ ] real, allocatable, dimension(:), device :: result_dev [ ] allocate( result_dev(nbrthreads), stat = status ) [ ] call partialsum<<<blockspergrid, threadsperblock>>>(n, nbrthreads, result_dev) r = cudathreadsynchronize() ierr = cudagetlasterror() if (ierr /= 0) then write (*,"(A, ' ', A)") 'CUDA Error: ', cudageterrorstring (ierr) stop end if [ ] deallocate( result_dev ) 9

10 Subprogram Qualifiers Host subprogram, attributes(host): function or subroutine, can only be called from another host subprogram, default attribute Device subprogram, kernel subroutine, attributes(global): only subroutine, may only be called from host subprogram using chevron call syntax Device subprogram, attributes(device): function or subroutine, only callable from device subprogram Certain restrictions apply to device subprograms: device subprograms not recursive no assumed-shape arrays as dummy arguments 10

11 Variable Qualifiers Device global memory (device): use modules to share between host and device subprograms Constant memory space (constant): use modules to share between host and device subprograms, may be modified only in host subprograms, may not be allocatable Device shared memory (shared): only in device subprogram, shared by all threads in a block Page-locked memory on host (pinned): must be an allocatable array 11

12 Fortran Specifics Attributed variable declarations: device, constant, shared, pinned; use allocate for dynamic memory allocation (Implicit) Data transfer: A = Adev or Adev = A or Adev = Bdev B = A + Adev C = A * Adev + B Computation done on host Fortran intrinsics in device subprograms: A limited number of standard intrinsics available New intrinsics: synthreads, gpu_time, 12

13 Build and Run CUDA Fortran Code Use the PGI Compiler: module switch intel pgi Filename suffix:.cuf or.cuf Explicitly enable CUDA Fortran: -Mcuda Emulation mode: -Mcuda=emulate 13

14 PGI ACCELERATOR PROGRAMMING MODEL 14

15 Overview Usable for C and Fortran Directives like OpenMP C: #pragma acc <directive-name> [<clause>] Fortran:!$acc <directive-name> [<clause>] OpenMP PGI Accelerator!$omp parallel do private(tmp) do i = 1, n tmp = 2.0 * x(i) y(i) = tmp * tmp!$acc region do do i = 1, n tmp = 2.0 * x(i) y(i) = tmp * tmp 15

16 Getting Started Test connection to GPU pgaccelinfo Compilation pgfortran ta=nvidia Minfo=accel a.f90 -ta=nvidia: build for (Nvidia) GPU -Minfo=accel: enable compiler feedback Compilation generates Host x86 Code & GPU/Accelerator Code Print message when kernel is executed export ACC_NOTIFY=1 16

17 Excerpt of Jacobi Example do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0! Copy new solution into old uold = afu! Compute stencil, residual, & update do j = 1, irows 2 do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) (linuxc7, Intel Core2Quad Q9400, PGI 10.2, Matrix 5000x5000) ~ 1460 MFlops single precision! 17

18 Jacobi & PGI Acc: 1st try do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do do j = 1, irows 2 do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) Compute Region Directive Loop Mapping Directive (linuxc7, Nvidia GeForce GT220, PGI 10.2, Matrix 5000x5000) ~ 1069 MFlops 18

19 Jacobi & PGI Acc: Compiler Feedback 59, Loop not vectorized/parallelized: multiple exits 68, Generating copyin(aff(1:icols-2,1:irows-2)) Generating copyin(afu(0:icols-1,0:irows-1)) Generating copyout(afu(1:icols-2,1:irows-2)) Generating copyout(uold(0:icols-1,0:irows-1)) 69, Loop is parallelizable Accelerator kernel generated 69,!$acc do parallel, vector(16) 73, Loop is parallelizable 74, Loop is parallelizable Accelerator kernel generated 73,!$acc do parallel, vector(16) 74,!$acc do parallel, vector(16) Cached references to size [18x18] block of 'uold' 84, Sum reduction generated for residual 19

20 Jacobi & PGI Acc: 2nd try do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region local(uold), copy(afu), copyin(aff) Data Copy clauses: copyout,! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do do j = 1, irows 2 do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) ~ 1230 MFlops 20

21 Jacobi & PGI Acc: Compiler Feedback 59, Loop not vectorized/parallelized: multiple exits 68, Generating copyin(aff(:,:)) Generating copy(afu(:,:)) Generating local(uold(:,:)) 69, Loop is parallelizable Accelerator kernel generated 69,!$acc do parallel, vector(16) 73, Loop is parallelizable 74, Loop is parallelizable Accelerator kernel generated 73,!$acc do parallel, vector(16) 74,!$acc do parallel, vector(16) Cached references to size [18x18] block of 'uold' 84, Sum reduction generated for residual 21

22 Jacobi & PGI Acc: 3rd try do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region local(uold), copy(afu), copyin(aff)! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do parallel Loop Scheduling: parallel clause (doall parallelism) do j = 1, irows 2!$acc do vector(256) Loop Scheduling: vector clause (synchronous parallelism) do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows) ~ 1570 MFlops 22

23 Jacobi & PGI Acc: Compiler Feedback 59, Loop not vectorized/parallelized: multiple exits 68, Generating copyin(aff(:,:)) Generating copy(afu(:,:)) Generating local(uold(:,:)) 69, Loop is parallelizable Accelerator kernel generated 69,!$acc do parallel, vector(16) 73, Loop is parallelizable 75, Loop is parallelizable Accelerator kernel generated 73,!$acc do parallel 75,!$acc do vector(256) Cached references to size [258x3] block of 'uold' 85, Sum reduction generated for residual 23

24 Jacobi & PGI Acc: CUDA Profiler 24

25 25 Jacobi & PGI Acc: 4th try!$acc data region local(uold), copy(afu), copyin(aff) Data Region Directive do while (iitercount < iitermax.and. residual > ftolerance) residual = 0.0d0!$acc region! Copy new solution into old uold = afu! Compute stencil, residual, & update!$acc do parallel do j = 1, irows 2!$acc do vector(256) do i = 1, icols 2! Evaluate residual flres = (ax * (uold(i-1, j) + uold(i+1, j)) & + ay * (uold(i, j-1) + uold(i, j+1)) & + b * uold(i, j) - aff(i, j)) / b! Update solution afu(i, j) = uold(i, j) - frelax * flres! Accumulate residual error residual = residual + flres * flres!$acc end region iitercount = iitercount + 1 residual = SQRT(residual) / REAL(iCols * irows)!$acc end data region ~ 3550 MFlops

26 Jacobi & PGI Acc: Compiler Feedback 59, Generating local(uold(:,:)) Generating copyin(aff(:,:)) Generating copy(afu(:,:)) 60, Loop not vectorized/parallelized: multiple exits 70, Loop is parallelizable Accelerator kernel generated 70,!$acc do parallel, vector(16) 74, Loop is parallelizable 76, Loop is parallelizable Accelerator kernel generated 74,!$acc do parallel 76,!$acc do vector(256) Cached references to size [258x3] block of 'uold' 86, Sum reduction generated for residual 26

27 Jacobi & PGI Acc: CUDA Profiler 27

28 PGI 10.1 vs 10.2/10.3 [ ]!$acc region! Copy new solution into old uold = afu [ ] 66, Memory copy idiom, array assignment replaced by call to pgf90_mcopy4 => Copy operation performed on host CPU 10.1 ~ 4180 MFlops 10.2 / , Loop is parallelizable Accelerator kernel generated 66,!$acc do parallel, vector(16) => Copy operation performed on GPU ~ 3550 MFlops Performance on Tesla PGI: 9651 Mflops (10.1) vs (10.2) 28

29 Best Version Yet Replace [ ]!$acc region! Copy new solution into old uold = afu [ ] [ ]!$acc region! Copy new solution into old!$acc do parallel do j = 0, irows - 1!$acc do vector(256) do i = 0, icols - 1 uold(i,j) = afu(i,j) [ ] By 10.1 ~ 6070 MFlops 10.2 / 10.3 ~ 3800 MFlops 29

30 Summary for Jacobi Example Data movement between Host and Accelerator Use data copy clauses (local, copy, ) Copy whole arrays Use a data region Parallelism on accelerator Try different loop scheduling (e.g. doall and synchronous parallelism: parallel, vector) Try to make width in vector(width) directives a multiple of 32 to match Nvidia CUDA warp size PGI Links

31 Performance Considerations Data movement between Host and Accelerator Minimize amount, number and frequency Maximize bandwidth Optimize data allocation in device memory Parallelism on accelerator Lots of MIMD parallelism to fill multiprocessors Lots of SIMD parallelism to fill cores on a multiprocessor Lots more MIMD parallelism Data movement between device memory and cores Minimize frequency Optimize strides: stride-1 in vector dimension Optimize alignment: 16-word aligned in vector dimension Store array blocks in data cache (CUDA shared memory) 31

32 BACKUP SLIDES 32

33 PGI Accelerator: Compiler Flags pgfortran ta=nvidia,time Minfo=accel a.f90 Links in a timer library: collects and prints out simple timing information about the accelerator regions and generated kernels jacobi 59: region entered 1 time time(us): total= init= region= kernels= data= w/o init: total= max= min= avg= : kernel launched 20 times grid: [313x313] block: [16x16] time(us): total= max=16391 min=15497 avg= : kernel launched 20 times grid: [4998] block: [256] time(us): total= max=67314 min=66964 avg= : kernel launched 20 times grid: [1] block: [256] time(us): total=587 max=36 min=27 avg=29 33

34 PGI Accelerator: Runtime library routines Fortran: module accel_lib C: accel.h acc_get_num_devices (devicetype) acc_set_device (devicetype) acc_init (devicetype) Initialise runtime for device, e.g. for isolating initialisation cost from computational cost acc_shutdown (devicetype) 34

35 PGI Accelerator: Compiler Flags pgfortran ta=nvidia,cc11 Minfo=accel a.f90 Generates code for compute capability 1.1 Compute capability depends on graphics card pgfortran ta=nvidia,keepgpu Minfo=accel a.f90 Keeps the kernel source files 35

36 More Information PGI User s Guide CUDA Fortran: PGI Accelerator: PGI User Forum: 36

37 THANK YOU FOR YOUR ATTENTION! 37

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University Introduction to OpenACC Shaohao Chen Research Computing Services Information Services and Technology Boston University Outline Introduction to GPU and OpenACC Basic syntax and the first OpenACC program:

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC

More information

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI The Case for Fortran Clear, straight-forward syntax Successful legacy in the scientific community

More information

PGI Fortran & C Accelerator Programming Model. The Portland Group

PGI Fortran & C Accelerator Programming Model. The Portland Group PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC DR. CHRISTOPH ANGERER, NVIDIA *) THANKS TO JEFF LARKIN, NVIDIA, FOR THE SLIDES 3 APPROACHES TO GPU PROGRAMMING Applications Libraries Compiler Directives

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Programming paradigms for GPU devices

Programming paradigms for GPU devices Programming paradigms for GPU devices OpenAcc Introduction Sergio Orlandini s.orlandini@cineca.it 1 OpenACC introduction express parallelism optimize data movements practical examples 2 3 Ways to Accelerate

More information

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview

PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview PGI Fortran & C Accelerator Compilers and Programming Model Technology Preview The Portland Group Published: v0.7 November 2008 Contents 1. Introduction... 1 1.1 Scope... 1 1.2 Glossary... 1 1.3 Execution

More information

Using PGI compilers to solve initial value problems on GPU

Using PGI compilers to solve initial value problems on GPU Using PGI compilers to solve initial value problems on GPU Ladislav Hanyk Charles University in Prague, Faculty of Mathematics and Physics Czech Republic Outline 1. NVIDIA GPUs and CUDA developer tools

More information

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC Jeff Larkin, NVIDIA Developer Technologies AGENDA Accelerated Computing Basics What are Compiler Directives? Accelerating Applications with OpenACC Identifying

More information

OpenACC Fundamentals. Steve Abbott November 13, 2016

OpenACC Fundamentals. Steve Abbott November 13, 2016 OpenACC Fundamentals Steve Abbott , November 13, 2016 Who Am I? 2005 B.S. Physics Beloit College 2007 M.S. Physics University of Florida 2015 Ph.D. Physics University of New Hampshire

More information

Getting Started with Directive-based Acceleration: OpenACC

Getting Started with Directive-based Acceleration: OpenACC Getting Started with Directive-based Acceleration: OpenACC Ahmad Lashgar Member of High-Performance Computing Research Laboratory, School of Computer Science Institute for Research in Fundamental Sciences

More information

The PGI Fortran and C99 OpenACC Compilers

The PGI Fortran and C99 OpenACC Compilers The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction

More information

COMP Parallel Computing. Programming Accelerators using Directives

COMP Parallel Computing. Programming Accelerators using Directives COMP 633 - Parallel Computing Lecture 15 October 30, 2018 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit Jeff Larkin, Nvidia COMP 633 - Prins Directives for Accelerator

More information

OpenACC introduction (part 2)

OpenACC introduction (part 2) OpenACC introduction (part 2) Aleksei Ivakhnenko APC Contents Understanding PGI compiler output Compiler flags and environment variables Compiler limitations in dependencies tracking Organizing data persistence

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 10-12, 2013 HPC@LSU

More information

Porting Guide. CUDA Fortran COMPILERS &TOOLS

Porting Guide. CUDA Fortran COMPILERS &TOOLS Porting Guide CUDA Fortran COMPILERS &TOOLS 1 Simple Increment Code Host CPU and its memory The cudafor module incudes CUDA Fortran definitions and interfaces to the runtime API The device variable attribute

More information

CUDA Fortran COMPILERS &TOOLS. Porting Guide

CUDA Fortran COMPILERS &TOOLS. Porting Guide Porting Guide CUDA Fortran CUDA Fortran is the Fortran analog of the NVIDIA CUDA C language for programming GPUs. This guide includes examples of common language features used when porting Fortran applications

More information

INTRODUCTION TO OPENACC

INTRODUCTION TO OPENACC INTRODUCTION TO OPENACC Hossein Pourreza hossein.pourreza@umanitoba.ca March 31, 2016 Acknowledgement: Most of examples and pictures are from PSC (https://www.psc.edu/images/xsedetraining/openacc_may2015/

More information

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel An Introduction to OpenACC Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel Chapter 1 Introduction OpenACC is a software accelerator that uses the host and the device. It uses compiler

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenACC Fundamentals. Steve Abbott November 15, 2017 OpenACC Fundamentals Steve Abbott , November 15, 2017 AGENDA Data Regions Deep Copy 2 while ( err > tol && iter < iter_max ) { err=0.0; JACOBI ITERATION #pragma acc parallel loop reduction(max:err)

More information

PROFILER OPENACC TUTORIAL. Version 2018

PROFILER OPENACC TUTORIAL. Version 2018 PROFILER OPENACC TUTORIAL Version 2018 TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter 1. 2. 3. 4. 5. Tutorial Setup... 1 Profiling the application... 2 Adding OpenACC directives...4 Improving

More information

PGPROF OpenACC Tutorial

PGPROF OpenACC Tutorial PGPROF OpenACC Tutorial Version 2017 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. Tutorial Setup...1 Chapter 2. Profiling the application... 2 Chapter 3. Adding OpenACC directives... 4 Chapter

More information

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OpenACC Course Lecture 1: Introduction to OpenACC September 2015 OpenACC Course Lecture 1: Introduction to OpenACC September 2015 Course Objective: Enable you to accelerate your applications with OpenACC. 2 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15:

More information

CUDA Fortran Brent Leback The Portland Group

CUDA Fortran Brent Leback The Portland Group CUDA Fortran 2013 Brent Leback The Portland Group brent.leback@pgroup.com Why Fortran? Rich legacy in the scientific community Semantics easier to vectorize/parallelize Array descriptors Modules Fortran

More information

CUDA Fortran. Programming Guide and Reference. Release The Portland Group

CUDA Fortran. Programming Guide and Reference. Release The Portland Group CUDA Fortran Programming Guide and Reference Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary

More information

Introduction to GPGPUs

Introduction to GPGPUs Introduction to GPGPUs Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de PPCES 2012 Rechen- und Kommunikationszentrum (RZ) Links General GPGPU Community: http://gpgpu.org/ GPU Computing Community: http://gpucomputing.net/

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC 2018 HPC Workshop: Parallel Programming Alexander B. Pacheco Research Computing July 17-18, 2018 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage

Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage Optimization and porting of a numerical code for simulations in GRMHD on CPU/GPU clusters PRACE Winter School Stage INFN - Università di Parma November 6, 2012 Table of contents 1 Introduction 2 3 4 Let

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

More information

PGI Visual Fortran Release Notes. Version The Portland Group

PGI Visual Fortran Release Notes. Version The Portland Group PGI Visual Fortran Release Notes Version 14.1 The Portland Group PGI Visual Fortran Copyright 2014 NVIDIA Corporation All rights reserved. Printed in the United States of America First Printing: Release

More information

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 PGI Visual Fortran Release Notes 2010 The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 While every precaution has been taken in the preparation of this document, The Portland

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Introduction to Compiler Directives with OpenACC

Introduction to Compiler Directives with OpenACC Introduction to Compiler Directives with OpenACC Agenda Fundamentals of Heterogeneous & GPU Computing What are Compiler Directives? Accelerating Applications with OpenACC - Identifying Available Parallelism

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

OPENACC ONLINE COURSE 2018

OPENACC ONLINE COURSE 2018 OPENACC ONLINE COURSE 2018 Week 1 Introduction to OpenACC Jeff Larkin, Senior DevTech Software Engineer, NVIDIA ABOUT THIS COURSE 3 Part Introduction to OpenACC Week 1 Introduction to OpenACC Week 2 Data

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 PGI Visual Fortran Release Notes 2010 The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 While every precaution has been taken in the preparation of this document, The Portland

More information

CUDA Fortran. Programming Guide and Reference. Release The Portland Group

CUDA Fortran. Programming Guide and Reference. Release The Portland Group CUDA Fortran Programming Guide and Reference Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary

More information

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation

GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC. Authored by Mark Harris NVIDIA Corporation GPU Computing with OpenACC Directives Presented by Bob Crovella For UNC Authored by Mark Harris NVIDIA Corporation GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

An Introduction to OpenACC - Part 1

An Introduction to OpenACC - Part 1 An Introduction to OpenACC - Part 1 Feng Chen HPC User Services LSU HPC & LONI sys-help@loni.org LONI Parallel Programming Workshop Louisiana State University Baton Rouge June 01-03, 2015 Outline of today

More information

Use of Accelerate Tools PGI CUDA FORTRAN Jacket

Use of Accelerate Tools PGI CUDA FORTRAN Jacket Use of Accelerate Tools PGI CUDA FORTRAN Jacket Supercomputing Institute For Advanced Computational Research e-mail: szhang@msi.umn.edu or help@msi.umn.edu Tel: 612-624-8858 (direct), 612-626-0802(help)

More information

The PGI Accelerator Programming Model on NVIDIA GPUs Part 1

The PGI Accelerator Programming Model on NVIDIA GPUs Part 1 Technical News from The Portland Group PGI Home Page June 2009 The PGI Accelerator Programming Model on NVIDIA GPUs Part 1 by Michael Wolfe, PGI Compiler Engineer GPUs have a very high compute capacity,

More information

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block. API 2.6 R EF ER ENC E G U I D E The OpenACC API 2.6 The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

PGI Visual Fortran Release Notes. Version The Portland Group

PGI Visual Fortran Release Notes. Version The Portland Group PGI Visual Fortran Release Notes Version 13.3 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary of STMicroelectronics,

More information

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035

PGI Visual Fortran. Release Notes The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 PGI Visual Fortran Release Notes 2010 The Portland Group STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035 While every precaution has been taken in the preparation of this document, The Portland

More information

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents

More information

Early Experiences with the OpenMP Accelerator Model

Early Experiences with the OpenMP Accelerator Model Early Experiences with the OpenMP Accelerator Model Canberra, Australia, IWOMP 2013, Sep. 17th * University of Houston LLNL-PRES- 642558 This work was performed under the auspices of the U.S. Department

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

PVF Release Notes. Version PGI Compilers and Tools

PVF Release Notes. Version PGI Compilers and Tools PVF Release Notes Version 2015 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. PVF Release Overview...1 1.1. Product Overview... 1 1.2. Microsoft Build Tools... 2 1.3. Terms and Definitions...2 Chapter

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

PGI Visual Fortran Release Notes

PGI Visual Fortran Release Notes PGI Visual Fortran Release Notes Version 14.4 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. PVF Release Overview...1 1.1. Product Overview... 1 1.2. Microsoft Build Tools... 2 1.3. Terms and Definitions...2

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

OpenACC compiling and performance tips. May 3, 2013

OpenACC compiling and performance tips. May 3, 2013 OpenACC compiling and performance tips May 3, 2013 OpenACC compiler support Cray Module load PrgEnv-cray craype-accel-nvidia35 Fortran -h acc, noomp # openmp is enabled by default, be careful mixing -fpic

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

CUDA Fortran. Programming Guide and Reference. Release The Portland Group

CUDA Fortran. Programming Guide and Reference. Release The Portland Group CUDA Fortran Programming Guide and Reference Release 2010 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary

More information

PVF Release Notes. Version PGI Compilers and Tools

PVF Release Notes. Version PGI Compilers and Tools PVF Release Notes Version 2015 PGI Compilers and Tools TABLE OF CONTENTS Chapter 1. PVF Release Overview...1 1.1. Product Overview... 1 1.2. Microsoft Build Tools... 2 1.3. Terms and Definitions...2 Chapter

More information

Using a GPU in InSAR processing to improve performance

Using a GPU in InSAR processing to improve performance Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics

More information

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer OpenACC 2.5 and Beyond Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com OpenACC Timeline 2008 PGI Accelerator Model (targeting NVIDIA GPUs) 2011 OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs)

More information

GPU Programming. Ringberg Theorie Seminar 2010

GPU Programming. Ringberg Theorie Seminar 2010 or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render

More information

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application

More information

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA Directives for Accelerators ABOUT OPENACC GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers

More information

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA

Introduction to OpenACC. Peng Wang HPC Developer Technology, NVIDIA Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com Outline Introduction of directive-based parallel programming Basic parallel construct Data management Controlling parallelism

More information

PVF RELEASE NOTES. Version 2017

PVF RELEASE NOTES. Version 2017 PVF RELEASE NOTES Version 2017 TABLE OF CONTENTS Chapter 1. PVF Release Overview...1 1.1. Product Overview... 1 1.2. Microsoft Build Tools... 2 1.3. Terms and Definitions... 2 Chapter 2. New and Modified

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

CUDA 5 Features in PGI CUDA Fortran 2013

CUDA 5 Features in PGI CUDA Fortran 2013 第 1 頁, 共 7 頁 Technical News from The Portland Group PGI Home Page March 2013 CUDA 5 Features in PGI CUDA Fortran 2013 by Brent Leback PGI Engineering Manager The 2013 release of PGI CUDA Fortran introduces

More information

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always

More information

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library Libraries Don t reinvent the wheel. Specialized math libraries are likely faster. BLAS: Basic Linear Algebra Subprograms LAPACK: Linear Algebra Package (uses BLAS) http://www.netlib.org/lapack/ to download

More information

Directive-based Programming for Highly-scalable Nodes

Directive-based Programming for Highly-scalable Nodes Directive-based Programming for Highly-scalable Nodes Doug Miles Michael Wolfe PGI Compilers & Tools NVIDIA Cray User Group Meeting May 2016 Talk Outline Increasingly Parallel Nodes Exposing Parallelism

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

RapidMind & PGI Accelerator Compiler. Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften

RapidMind & PGI Accelerator Compiler. Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften RapidMind & PGI Accelerator Compiler Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften volker.weinberg@lrz.de PRACE Workshop New Languages & Future Technology Prototypes

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information