CUDA 5 Features in PGI CUDA Fortran 2013

Size: px
Start display at page:

Download "CUDA 5 Features in PGI CUDA Fortran 2013"

Transcription

1 第 1 頁, 共 7 頁 Technical News from The Portland Group PGI Home Page March 2013 CUDA 5 Features in PGI CUDA Fortran 2013 by Brent Leback PGI Engineering Manager The 2013 release of PGI CUDA Fortran introduces support for many of the new CUDA 5.0 features. This article briefly describes how you can take advantage of these features. Over the course of the paper, we'll show many examples of matrix multiplication using double precision data types. Separate Compilation of Device Code in CUDA An excellent set of software packages I've used over the years are the multiple precision libraries from David Bailey, et. al. In CUDA 5.0, it is now easy to incorporate community Fortran libraries like these into your CUDA Fortran programs. Here is a simple matrix multiply example that we've shown many times in the past. The global subroutine in CUDA Fortran looks like this: attributes(global) subroutine dgemm16(a, b, c, m, n, k) integer, value :: m, n, k real(8) :: a(m,*), b(k,*), c(m,*) real(8), shared, dimension(17,16) :: bs real(8), device :: cloc(16), ax inx = threadidx%x iny = threadidx%y ibx = (blockidx%x-1) * 16 iby = (blockidx%y-1) * 16 ia = ibx + (iny-1)*16 + inx ib = inx ic = ia jb = iby + iny jc = iby + 1 do i = 1, 16 cloc(i) = 0.0d0 do ik = 1, k, 16 bs(iny,inx) = b(ib,jb) call syncthreads() do j = 1, 16

2 第 2 頁, 共 7 頁 ax = a(ia,ik+j-1) do i = 1, 16 cloc(i) = cloc(i) + ax * bs(i,j) ib = ib + 16 call syncthreads() do i = 1, 16 c(ic,jc+i-1) = cloc(i) call syncthreads() end subroutine This kernel uses a little (padded) square shared memory array to cache the values of B between the threads in the threadblock, and each thread reads a value of A from global memory and applies it 16 times in the inner loop. It is a simple implementation and performs reasonably well. For this article, I've stacked the deck against this kernel and used poorly conditioned data, randomly ordered sets of positive and negative values with magnitudes between 2**(-127) and 2**128 (roughly 10E+/-38). With exact arithmetic, the numbers should cancel out and give a result of 0.0d0 for each element of C. Instead, for a set of 512x512 matrices, we see about half the values are nonzero: %./a.out errors were encountered Max error was E+22 Ave error was E x512 * 512x512: ms GFlops/s ### C(1,1)= E-039 Compared to the largest number in the dataset, the results are good to 16 digits, but still, they are quite a ways from 0.0 which is what we want. Now, let's get back to separate compilation. I put together a very quick port of David Bailey's ddfun90 library to CUDA Fortran. The ddfun90 library performs double-double arithmetic utilizing two real*8 values held in a derived type. The library uses generic interfaces, and once built, can be enabled very quickly in your F90 code. There are certain constructs in the library (notably error handling using Fortran I/O) that I've left for another day. For now, I'm concentrating on double-double add, subtract, multiply and divide. For these routines, I made changes to the library file ddfun90 for the device entry points, adding attributes(device) to the functions and subroutines I want to call from device code: attributes(device) subroutine ddadd (dda, ddb, ddc)! This subroutine computes ddc = dda + ddb. implicit none real*8 dda(2), ddb(2), ddc(2) real*8 e, t1, t2

3 第 3 頁, 共 7 頁! Compute dda + ddb using Knuth's trick. t1 = dda(1) + ddb(1) e = t1 - dda(1) t2 = ((ddb(1) - e) + (dda(1) - (t1 - e))) + dda(2) + ddb(2)! The result is t1 + t2, after normalization. ddc(1) = t1 + t2 ddc(2) = t2 - (ddc(1) - t1) return end subroutine Similarly, in ddmod90, I added the device attribute again on the overloaded procedures I wanted to use: interface operator (+) module procedure dd_addqq end interface attributes(device) function dd_addqq (qa, qb) implicit real*8 (d), & type (dd_real) (q), complex (kdb) (x), type (dd_complex) (z) type (dd_real):: dd_addqq intent (in):: qa, qb call ddadd (qa%ddr, qb%ddr, dd_addqq%ddr) return end function To take advantage of separate compilation with CUDA Fortran use the rdc flag. This flag, carried over from nvcc, stands for "relocatable device code". % pgf90 -c -O2 -Mcuda=rdc ddfun90.cuf ddmod90.cuf The generated.o files can be used like any other object file, or even put into a library: % ar rc ddfunc.a ddfun90.o ddmod90.o Now, I need to modify my original program to use the extended precision arithmetic. Thanks to the overloaded operators, this is pretty easy. First, use the ddmodule: MODULE simple_dgemm use ddmodule! <- defines entry points, data types, and global data CONTAINS! for use in contained subprograms And then change the type of the shared and local variables: type(dd_real), shared, dimension(17,16) :: bs

4 第 4 頁, 共 7 頁 type(dd_real), device :: cloc(16), ax That's it. Arithmetic previously done in double precision, such as: cloc(i) = cloc(i) + ax * bs(i,j) is now done in double-double precision due to the type changes. You can build the driver program and link in either the.o files or.a archive as usual. Just remember to add the rdc option to Mcuda. % pgf90 -O2 -Mcuda=rdc dgemmpaper.cuf ddfun90.o ddmod90.o or % pgf90 -O2 -Mcuda=rdc dgemmpaper.cuf ddfunc.a As you might suspect, now when we run it the performance is a much slower. %./a.out Test passed! 512x512 * 512x512: ms GFlops/s ### C(1,1)= But look, instead of generating bad answers quickly we are generating reliably good answers and in the process we've learned just how easy it is to take advantage of third-party libraries with CUDA 5.0. Of course, the input data could be changed so that even the double-double arithmetic produces incorrect results. Bailey's page has a link to another package, mpfun90, that could probably address most of those issues too. The interfaces in that library are similar in nature to ddfun90. The rdc flag by itself implies CUDA 5.0. In other words Mcuda=rdc is equivalent to Mcuda=cuda5.0,rdc. Using rdc is not supported in CUDA 4.2. Another CUDA restraint is that separate compilation only works for compute capability 2.0 (Fermi) and above. If you're interested, you can download this entire code (minus ddfun90) from the PGI website. Dynamic Parallelism of Device Code in CUDA. PGI 2013 also adds support for dynamic parallelism in CUDA Fortran. This means a GPU kernel can launch one or more sub-kernels, and has access to a limited set of the CUDA API by which to control them. The original 512x512 matrix multiply above can be broken up into eight sub-matrix multiplies: C11 C12 = A11 A12 * B11 B12 C21 C22 A21 A22 B21 B22 C11 = A11*B11 + A12*B21 C12 = A11*B12 + A12*B22 C21 = A21*B11 + A22*B21 C22 = A21*B12 + A22*B22 The code for this decomposition can now all be contained in a CUDA Fortran global subroutine, as

5 第 5 頁, 共 7 頁 long as you are using CUDA 5.0 on a compute capability 3.5 (Kepler K20) card: real(8), device, allocatable :: m1(:,:), m2(:,:), m3(:,:), m4(:,:) real(8), device, allocatable :: m5(:,:), m6(:,:), m7(:,:), m8(:,:) type(dim3), device :: devthreads, devblocks newn = n / 2! For convenience, now assume square matrices allocate(m1(1:newn,1:newn)) allocate(m2(1:newn,1:newn)) allocate(m3(1:newn,1:newn)) allocate(m4(1:newn,1:newn)) allocate(m5(1:newn,1:newn)) allocate(m6(1:newn,1:newn)) allocate(m7(1:newn,1:newn)) allocate(m8(1:newn,1:newn)) devblocks = dim3(newn/256, newn/16, 1) devthreads = dim3(16, 16, 1) call dgemm16<<<devblocks,devthreads>>>(a(1,1), m, & b(1,1), k, & m1(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1,1+k/2), m, & b(1+k/2,1), k, & m2(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1,1), m, & b(1,1+n/2), k, & m3(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1,1+k/2), m, & b(1+k/2,1+n/2), k, & m4(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1), m, & b(1,1), k, & m5(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1+k/2), m, & b(1+k/2,1), k, & m6(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1), m, & b(1,1+n/2), k, & m7(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1+k/2), m, & b(1+k/2,1+n/2), k, & m8(1,1), newn, newn, newn, newn) istat = cudadevicesynchronize() call add16<<<1,devthreads>>>(m1, newn, m2, newn, c(1,1), m, newn) call add16<<<1,devthreads>>>(m3, newn, m4, newn, c(1,1+n/2), m, newn) call add16<<<1,devthreads>>>(m5, newn, m6, newn, c(1+m/2,1), m, newn) call add16<<<1,devthreads>>>(m7, newn, m8, newn, c(1+m/2,1+n/2), m, newn) istat = cudadevicesynchronize() deallocate(m1,m2,m3,m4,m5,m6,m7,m8) Just as in CUDA Fortran host code, the F90 allocate statement in device code wraps the

6 第 6 頁, 共 7 頁 cudamalloc() calls and F90 deallocate wraps cudafree(). The entire set of CUDA device API calls is available in CUDA Fortran. This code is compiled using these flags: % pgf90 -Mcuda=cuda5.0,cc35,rdc dgemmdynamic.cuf This code doesn't perform very well because all the launches are serialized along the same stream. CUDA 5.0 adds a create stream call from device code. Use it like this: integer :: flags integer(kind=cuda_stream_kind) :: istreams(8) flags = cudastreamnonblocking do i = 1, 8 istat = cudastreamcreatewithflags(istreams(i), flags) call dgemm16<<<devblocks,devthreads,0,istreams(1)>>>(a(1,1), m, & b(1,1), k, & m1(1,1), newn, newn, newn, newn) Now the performance you'll see from this version is comparable to the original. Note that you really shouldn't expect a speedup by launching routines from the device relative to launching code from the host. Dynamic parallelism is intended to ease programming; it's not a performance optimization in itself. That said, there is still room for improvement here. As a further optimization, you can allocate the new matrices and create the streams just once. Typically, in CUDA Fortran you move the declarations out from the global subroutine to the module level to ensure their state is saved across multiple kernel invocations. I've also put together one level of the Strassen matrix multiply algorithm using basically the same techniques as those shown above. The Strassen algorithm uses one fewer sub-matrix multiply at the cost of more additions. All thee versions of the code are included in my example tar file. Dynamic Parallelism through CUBLAS One easy way to take advantage of dynamic parallelism is through NVIDIA's newly available device libraries. The first library available from NVIDIA is the cublas library, and PGI provides device code wrappers for it just as we have done in the past for previous host-called cublas libraries. To create an equivalent dgemm routine to the previous examples, you could code a singlethreaded global subroutine like this: attributes(global) subroutine dgemm16(a, b, c, m, n, k) use cublas_device integer, value :: m, n, k

7 7 頁, 共 7 頁 double precision, device :: a(m,*), b(k,*), c(m,*) double precision, device :: alpha, beta type(cublashandle) :: ch1 integer transa, transb i = threadidx%x if (i.eq.1) then istat = cublascreate_v2(ch1) alpha = 1.0d0 beta = 0.0d0 transa = cublas_op_n transb = cublas_op_n istat = cublasdgemm_v2(ch1, transa, transb, m, n, k, alpha, & a, m, b, k, beta, c, m) istat = cublasdestroy_v2(ch1) end if return end subroutine Targeting New CUDA Runtime and Hardware with CUBLAS and PGI Compilers Make sure you compile correctly for your new target architectures. For now, the default CUDA version with PGI 13.x is CUDA 4.2: bash-4.1$ pgf90 -O2 dgemmhostcublas.cuf -lcublas bash-4.1$./a.out errors were encountered Max error was E+22 Ave error was E x512 * 512x512: ms GFlops/s ### C(1,1)= E-039 You can always use the latest supported CUDA version by specifying it on the command line explicitly: bash-4.1$ pgf90 -O2 -Mcuda=cuda5.0,cc35 dgemmhostcublas.cuf -lcublas bash-4.1$./a.out errors were encountered Max error was E+22 Ave error was E x512 * 512x512: ms GFlops/s ### C(1,1)= E-039 I'm certain you can top 582 GFlops/sec by making the problem bigger, but I'll leave that as an exercise to the reader. I probably didn't compute that many Flops total in the first 10 years of my career, so I'm not that greedy. Final Remarks There are potentially many more cool opportunities to take advantage of with CUDA Fortan and CUDA 5.0, more than I can show here. PGI will be showing these demos and more at GTC this month. Come by and chat.

CUDA Fortran Brent Leback The Portland Group

CUDA Fortran Brent Leback The Portland Group CUDA Fortran 2013 Brent Leback The Portland Group brent.leback@pgroup.com Why Fortran? Rich legacy in the scientific community Semantics easier to vectorize/parallelize Array descriptors Modules Fortran

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI The Case for Fortran Clear, straight-forward syntax Successful legacy in the scientific community

More information

Use of Accelerate Tools PGI CUDA FORTRAN Jacket

Use of Accelerate Tools PGI CUDA FORTRAN Jacket Use of Accelerate Tools PGI CUDA FORTRAN Jacket Supercomputing Institute For Advanced Computational Research e-mail: szhang@msi.umn.edu or help@msi.umn.edu Tel: 612-624-8858 (direct), 612-626-0802(help)

More information

CUDA Fortran COMPILERS &TOOLS. Porting Guide

CUDA Fortran COMPILERS &TOOLS. Porting Guide Porting Guide CUDA Fortran CUDA Fortran is the Fortran analog of the NVIDIA CUDA C language for programming GPUs. This guide includes examples of common language features used when porting Fortran applications

More information

Porting Guide. CUDA Fortran COMPILERS &TOOLS

Porting Guide. CUDA Fortran COMPILERS &TOOLS Porting Guide CUDA Fortran COMPILERS &TOOLS 1 Simple Increment Code Host CPU and its memory The cudafor module incudes CUDA Fortran definitions and interfaces to the runtime API The device variable attribute

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

GPU Programming Paradigms

GPU Programming Paradigms GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

CUDA Fortran. Programming Guide and Reference. Release The Portland Group

CUDA Fortran. Programming Guide and Reference. Release The Portland Group CUDA Fortran Programming Guide and Reference Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Speed Up Your Codes Using GPU

Speed Up Your Codes Using GPU Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh. GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

The PGI Fortran and C99 OpenACC Compilers

The PGI Fortran and C99 OpenACC Compilers The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs Reduction of a Symmetrical Matrix to Tridiagonal Form on GPUs By Shuotian Chen Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Adviser: Professor Volodymyr

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

CUDA Fortran. Programming Guide and Reference. Release The Portland Group

CUDA Fortran. Programming Guide and Reference. Release The Portland Group CUDA Fortran Programming Guide and Reference Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

CS 179: Lecture 10. Introduction to cublas

CS 179: Lecture 10. Introduction to cublas CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo Linear Algebra on the GPU, HPC Software Analyst SHARCNET, University of Waterloo Overview Brief introduction to GPUs CUBLAS and MAGMA libraries Developing efficient linear algebra code for the GPU - case

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

Introduction to OpenACC

Introduction to OpenACC Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC

More information

Introduction to Scientific Programming using GPGPU and CUDA

Introduction to Scientific Programming using GPGPU and CUDA Introduction to Scientific Programming using GPGPU and CUDA Day 1 Sergio Orlandini s.orlandini@cineca.it Mario Tacconi m.tacconi@cineca.it 0 Hands on: Compiling a CUDA program Environment and utility:

More information

A few notes on parallel programming with CUDA

A few notes on parallel programming with CUDA A few notes on parallel programming with CUDA Using parallel computing can significantly speed up execution and in many cases can be quite straightforward to implement. These notes focus on those simple

More information

NAG Fortran Library Routine Document F01CTF.1

NAG Fortran Library Routine Document F01CTF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

-npool -ndiag Z/DGEMM MPI_Alltoall MPI_Isend MPI_Irecv Wilkes-2 (Cambridge) NVIDIA DGX-1 Piz Daint (CSCS) Summit Dev (ORNL) Davide (CINECA) CPU PLX NIC GPU PCIe NVLink QE-GPU CSCS QE CSCS

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

PGI Fortran & C Accelerator Programming Model. The Portland Group

PGI Fortran & C Accelerator Programming Model. The Portland Group PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5

More information

GPU Memory Memory issue for CUDA programming

GPU Memory Memory issue for CUDA programming Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global

More information

CUDA Fortran. Programming Guide and Reference. Release The Portland Group

CUDA Fortran. Programming Guide and Reference. Release The Portland Group CUDA Fortran Programming Guide and Reference Release 2010 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Introduction to CUDA 5.0

Introduction to CUDA 5.0 Introduction to CUDA 5.0 CUDA 5 In this article, I will introduce the reader to CUDA 5.0. I will briefly talk about the architecture of the Kepler GPU (Graphics Processing Unit) and I will show you how

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

NAG Fortran Library Routine Document F01CWF.1

NAG Fortran Library Routine Document F01CWF.1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice

More information

Allocating Storage for 1-Dimensional Arrays

Allocating Storage for 1-Dimensional Arrays Allocating Storage for 1-Dimensional Arrays Recall that if we know beforehand what size we want an array to be, then we allocate storage in the declaration statement, e.g., real, dimension (100 ) :: temperatures

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

CUDA FORTRAN PROGRAMMING GUIDE AND REFERENCE. Version 2017

CUDA FORTRAN PROGRAMMING GUIDE AND REFERENCE. Version 2017 CUDA FORTRAN PROGRAMMING GUIDE AND REFERENCE Version 2017 TABLE OF CONTENTS Preface...viii Intended Audience... viii Organization... viii Conventions...viii Terminology... ix Related Publications... ix

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Using a GPU in InSAR processing to improve performance

Using a GPU in InSAR processing to improve performance Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

CUDA Fortran Programming Guide and Reference

CUDA Fortran Programming Guide and Reference CUDA Fortran Programming Guide and Reference Version 2014 PGI Compilers and Tools TABLE OF CONTENTS Preface... viii Intended Audience...viii Organization... viii Conventions... viii Terminology...ix Related

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs Using GPUs for mathematical problems in Fortran, Java and C# Alexey A. Romanenko arom@ccfit.nsu.ru

More information

Basics of CADA Programming - CUDA 4.0 and newer

Basics of CADA Programming - CUDA 4.0 and newer Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information