CUDA 5 Features in PGI CUDA Fortran 2013
|
|
- Allyson Cummings
- 5 years ago
- Views:
Transcription
1 第 1 頁, 共 7 頁 Technical News from The Portland Group PGI Home Page March 2013 CUDA 5 Features in PGI CUDA Fortran 2013 by Brent Leback PGI Engineering Manager The 2013 release of PGI CUDA Fortran introduces support for many of the new CUDA 5.0 features. This article briefly describes how you can take advantage of these features. Over the course of the paper, we'll show many examples of matrix multiplication using double precision data types. Separate Compilation of Device Code in CUDA An excellent set of software packages I've used over the years are the multiple precision libraries from David Bailey, et. al. In CUDA 5.0, it is now easy to incorporate community Fortran libraries like these into your CUDA Fortran programs. Here is a simple matrix multiply example that we've shown many times in the past. The global subroutine in CUDA Fortran looks like this: attributes(global) subroutine dgemm16(a, b, c, m, n, k) integer, value :: m, n, k real(8) :: a(m,*), b(k,*), c(m,*) real(8), shared, dimension(17,16) :: bs real(8), device :: cloc(16), ax inx = threadidx%x iny = threadidx%y ibx = (blockidx%x-1) * 16 iby = (blockidx%y-1) * 16 ia = ibx + (iny-1)*16 + inx ib = inx ic = ia jb = iby + iny jc = iby + 1 do i = 1, 16 cloc(i) = 0.0d0 do ik = 1, k, 16 bs(iny,inx) = b(ib,jb) call syncthreads() do j = 1, 16
2 第 2 頁, 共 7 頁 ax = a(ia,ik+j-1) do i = 1, 16 cloc(i) = cloc(i) + ax * bs(i,j) ib = ib + 16 call syncthreads() do i = 1, 16 c(ic,jc+i-1) = cloc(i) call syncthreads() end subroutine This kernel uses a little (padded) square shared memory array to cache the values of B between the threads in the threadblock, and each thread reads a value of A from global memory and applies it 16 times in the inner loop. It is a simple implementation and performs reasonably well. For this article, I've stacked the deck against this kernel and used poorly conditioned data, randomly ordered sets of positive and negative values with magnitudes between 2**(-127) and 2**128 (roughly 10E+/-38). With exact arithmetic, the numbers should cancel out and give a result of 0.0d0 for each element of C. Instead, for a set of 512x512 matrices, we see about half the values are nonzero: %./a.out errors were encountered Max error was E+22 Ave error was E x512 * 512x512: ms GFlops/s ### C(1,1)= E-039 Compared to the largest number in the dataset, the results are good to 16 digits, but still, they are quite a ways from 0.0 which is what we want. Now, let's get back to separate compilation. I put together a very quick port of David Bailey's ddfun90 library to CUDA Fortran. The ddfun90 library performs double-double arithmetic utilizing two real*8 values held in a derived type. The library uses generic interfaces, and once built, can be enabled very quickly in your F90 code. There are certain constructs in the library (notably error handling using Fortran I/O) that I've left for another day. For now, I'm concentrating on double-double add, subtract, multiply and divide. For these routines, I made changes to the library file ddfun90 for the device entry points, adding attributes(device) to the functions and subroutines I want to call from device code: attributes(device) subroutine ddadd (dda, ddb, ddc)! This subroutine computes ddc = dda + ddb. implicit none real*8 dda(2), ddb(2), ddc(2) real*8 e, t1, t2
3 第 3 頁, 共 7 頁! Compute dda + ddb using Knuth's trick. t1 = dda(1) + ddb(1) e = t1 - dda(1) t2 = ((ddb(1) - e) + (dda(1) - (t1 - e))) + dda(2) + ddb(2)! The result is t1 + t2, after normalization. ddc(1) = t1 + t2 ddc(2) = t2 - (ddc(1) - t1) return end subroutine Similarly, in ddmod90, I added the device attribute again on the overloaded procedures I wanted to use: interface operator (+) module procedure dd_addqq end interface attributes(device) function dd_addqq (qa, qb) implicit real*8 (d), & type (dd_real) (q), complex (kdb) (x), type (dd_complex) (z) type (dd_real):: dd_addqq intent (in):: qa, qb call ddadd (qa%ddr, qb%ddr, dd_addqq%ddr) return end function To take advantage of separate compilation with CUDA Fortran use the rdc flag. This flag, carried over from nvcc, stands for "relocatable device code". % pgf90 -c -O2 -Mcuda=rdc ddfun90.cuf ddmod90.cuf The generated.o files can be used like any other object file, or even put into a library: % ar rc ddfunc.a ddfun90.o ddmod90.o Now, I need to modify my original program to use the extended precision arithmetic. Thanks to the overloaded operators, this is pretty easy. First, use the ddmodule: MODULE simple_dgemm use ddmodule! <- defines entry points, data types, and global data CONTAINS! for use in contained subprograms And then change the type of the shared and local variables: type(dd_real), shared, dimension(17,16) :: bs
4 第 4 頁, 共 7 頁 type(dd_real), device :: cloc(16), ax That's it. Arithmetic previously done in double precision, such as: cloc(i) = cloc(i) + ax * bs(i,j) is now done in double-double precision due to the type changes. You can build the driver program and link in either the.o files or.a archive as usual. Just remember to add the rdc option to Mcuda. % pgf90 -O2 -Mcuda=rdc dgemmpaper.cuf ddfun90.o ddmod90.o or % pgf90 -O2 -Mcuda=rdc dgemmpaper.cuf ddfunc.a As you might suspect, now when we run it the performance is a much slower. %./a.out Test passed! 512x512 * 512x512: ms GFlops/s ### C(1,1)= But look, instead of generating bad answers quickly we are generating reliably good answers and in the process we've learned just how easy it is to take advantage of third-party libraries with CUDA 5.0. Of course, the input data could be changed so that even the double-double arithmetic produces incorrect results. Bailey's page has a link to another package, mpfun90, that could probably address most of those issues too. The interfaces in that library are similar in nature to ddfun90. The rdc flag by itself implies CUDA 5.0. In other words Mcuda=rdc is equivalent to Mcuda=cuda5.0,rdc. Using rdc is not supported in CUDA 4.2. Another CUDA restraint is that separate compilation only works for compute capability 2.0 (Fermi) and above. If you're interested, you can download this entire code (minus ddfun90) from the PGI website. Dynamic Parallelism of Device Code in CUDA. PGI 2013 also adds support for dynamic parallelism in CUDA Fortran. This means a GPU kernel can launch one or more sub-kernels, and has access to a limited set of the CUDA API by which to control them. The original 512x512 matrix multiply above can be broken up into eight sub-matrix multiplies: C11 C12 = A11 A12 * B11 B12 C21 C22 A21 A22 B21 B22 C11 = A11*B11 + A12*B21 C12 = A11*B12 + A12*B22 C21 = A21*B11 + A22*B21 C22 = A21*B12 + A22*B22 The code for this decomposition can now all be contained in a CUDA Fortran global subroutine, as
5 第 5 頁, 共 7 頁 long as you are using CUDA 5.0 on a compute capability 3.5 (Kepler K20) card: real(8), device, allocatable :: m1(:,:), m2(:,:), m3(:,:), m4(:,:) real(8), device, allocatable :: m5(:,:), m6(:,:), m7(:,:), m8(:,:) type(dim3), device :: devthreads, devblocks newn = n / 2! For convenience, now assume square matrices allocate(m1(1:newn,1:newn)) allocate(m2(1:newn,1:newn)) allocate(m3(1:newn,1:newn)) allocate(m4(1:newn,1:newn)) allocate(m5(1:newn,1:newn)) allocate(m6(1:newn,1:newn)) allocate(m7(1:newn,1:newn)) allocate(m8(1:newn,1:newn)) devblocks = dim3(newn/256, newn/16, 1) devthreads = dim3(16, 16, 1) call dgemm16<<<devblocks,devthreads>>>(a(1,1), m, & b(1,1), k, & m1(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1,1+k/2), m, & b(1+k/2,1), k, & m2(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1,1), m, & b(1,1+n/2), k, & m3(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1,1+k/2), m, & b(1+k/2,1+n/2), k, & m4(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1), m, & b(1,1), k, & m5(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1+k/2), m, & b(1+k/2,1), k, & m6(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1), m, & b(1,1+n/2), k, & m7(1,1), newn, newn, newn, newn) call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1+k/2), m, & b(1+k/2,1+n/2), k, & m8(1,1), newn, newn, newn, newn) istat = cudadevicesynchronize() call add16<<<1,devthreads>>>(m1, newn, m2, newn, c(1,1), m, newn) call add16<<<1,devthreads>>>(m3, newn, m4, newn, c(1,1+n/2), m, newn) call add16<<<1,devthreads>>>(m5, newn, m6, newn, c(1+m/2,1), m, newn) call add16<<<1,devthreads>>>(m7, newn, m8, newn, c(1+m/2,1+n/2), m, newn) istat = cudadevicesynchronize() deallocate(m1,m2,m3,m4,m5,m6,m7,m8) Just as in CUDA Fortran host code, the F90 allocate statement in device code wraps the
6 第 6 頁, 共 7 頁 cudamalloc() calls and F90 deallocate wraps cudafree(). The entire set of CUDA device API calls is available in CUDA Fortran. This code is compiled using these flags: % pgf90 -Mcuda=cuda5.0,cc35,rdc dgemmdynamic.cuf This code doesn't perform very well because all the launches are serialized along the same stream. CUDA 5.0 adds a create stream call from device code. Use it like this: integer :: flags integer(kind=cuda_stream_kind) :: istreams(8) flags = cudastreamnonblocking do i = 1, 8 istat = cudastreamcreatewithflags(istreams(i), flags) call dgemm16<<<devblocks,devthreads,0,istreams(1)>>>(a(1,1), m, & b(1,1), k, & m1(1,1), newn, newn, newn, newn) Now the performance you'll see from this version is comparable to the original. Note that you really shouldn't expect a speedup by launching routines from the device relative to launching code from the host. Dynamic parallelism is intended to ease programming; it's not a performance optimization in itself. That said, there is still room for improvement here. As a further optimization, you can allocate the new matrices and create the streams just once. Typically, in CUDA Fortran you move the declarations out from the global subroutine to the module level to ensure their state is saved across multiple kernel invocations. I've also put together one level of the Strassen matrix multiply algorithm using basically the same techniques as those shown above. The Strassen algorithm uses one fewer sub-matrix multiply at the cost of more additions. All thee versions of the code are included in my example tar file. Dynamic Parallelism through CUBLAS One easy way to take advantage of dynamic parallelism is through NVIDIA's newly available device libraries. The first library available from NVIDIA is the cublas library, and PGI provides device code wrappers for it just as we have done in the past for previous host-called cublas libraries. To create an equivalent dgemm routine to the previous examples, you could code a singlethreaded global subroutine like this: attributes(global) subroutine dgemm16(a, b, c, m, n, k) use cublas_device integer, value :: m, n, k
7 7 頁, 共 7 頁 double precision, device :: a(m,*), b(k,*), c(m,*) double precision, device :: alpha, beta type(cublashandle) :: ch1 integer transa, transb i = threadidx%x if (i.eq.1) then istat = cublascreate_v2(ch1) alpha = 1.0d0 beta = 0.0d0 transa = cublas_op_n transb = cublas_op_n istat = cublasdgemm_v2(ch1, transa, transb, m, n, k, alpha, & a, m, b, k, beta, c, m) istat = cublasdestroy_v2(ch1) end if return end subroutine Targeting New CUDA Runtime and Hardware with CUBLAS and PGI Compilers Make sure you compile correctly for your new target architectures. For now, the default CUDA version with PGI 13.x is CUDA 4.2: bash-4.1$ pgf90 -O2 dgemmhostcublas.cuf -lcublas bash-4.1$./a.out errors were encountered Max error was E+22 Ave error was E x512 * 512x512: ms GFlops/s ### C(1,1)= E-039 You can always use the latest supported CUDA version by specifying it on the command line explicitly: bash-4.1$ pgf90 -O2 -Mcuda=cuda5.0,cc35 dgemmhostcublas.cuf -lcublas bash-4.1$./a.out errors were encountered Max error was E+22 Ave error was E x512 * 512x512: ms GFlops/s ### C(1,1)= E-039 I'm certain you can top 582 GFlops/sec by making the problem bigger, but I'll leave that as an exercise to the reader. I probably didn't compute that many Flops total in the first 10 years of my career, so I'm not that greedy. Final Remarks There are potentially many more cool opportunities to take advantage of with CUDA Fortan and CUDA 5.0, more than I can show here. PGI will be showing these demos and more at GTC this month. Come by and chat.
CUDA Fortran Brent Leback The Portland Group
CUDA Fortran 2013 Brent Leback The Portland Group brent.leback@pgroup.com Why Fortran? Rich legacy in the scientific community Semantics easier to vectorize/parallelize Array descriptors Modules Fortran
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationSC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI
SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI The Case for Fortran Clear, straight-forward syntax Successful legacy in the scientific community
More informationUse of Accelerate Tools PGI CUDA FORTRAN Jacket
Use of Accelerate Tools PGI CUDA FORTRAN Jacket Supercomputing Institute For Advanced Computational Research e-mail: szhang@msi.umn.edu or help@msi.umn.edu Tel: 612-624-8858 (direct), 612-626-0802(help)
More informationCUDA Fortran COMPILERS &TOOLS. Porting Guide
Porting Guide CUDA Fortran CUDA Fortran is the Fortran analog of the NVIDIA CUDA C language for programming GPUs. This guide includes examples of common language features used when porting Fortran applications
More informationPorting Guide. CUDA Fortran COMPILERS &TOOLS
Porting Guide CUDA Fortran COMPILERS &TOOLS 1 Simple Increment Code Host CPU and its memory The cudafor module incudes CUDA Fortran definitions and interfaces to the runtime API The device variable attribute
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationGPU Programming Paradigms
GPU Programming with PGI CUDA Fortran and the PGI Accelerator Programming Model Boris Bierbaum, Sandra Wienke (26.3.2010) 1 GPUs@RZ Current: linuxc7: CentOS 5.3, Nvidia GeForce GT 220 hpc-denver: Windows
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationGPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34
1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions
More informationCUDA Fortran. Programming Guide and Reference. Release The Portland Group
CUDA Fortran Programming Guide and Reference Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary
More informationExtra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987
Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationPorting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method
Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco -
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationSpeed Up Your Codes Using GPU
Speed Up Your Codes Using GPU Wu Di and Yeo Khoon Seng (Department of Mechanical Engineering) The use of Graphics Processing Units (GPU) for rendering is well known, but their power for general parallel
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationGPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.
GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationThe PGI Fortran and C99 OpenACC Compilers
The PGI Fortran and C99 OpenACC Compilers Brent Leback, Michael Wolfe, and Douglas Miles The Portland Group (PGI) Portland, Oregon, U.S.A brent.leback@pgroup.com Abstract This paper provides an introduction
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationIntroduction to CUDA C
Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationGPU Computing with CUDA
GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical
More informationVector Addition on the Device: main()
Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationIntroduction to CUDA C
NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationCUDA Advanced Techniques 2 Mohamed Zahran (aka Z)
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationReduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs
Reduction of a Symmetrical Matrix to Tridiagonal Form on GPUs By Shuotian Chen Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Adviser: Professor Volodymyr
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationCUDA Fortran. Programming Guide and Reference. Release The Portland Group
CUDA Fortran Programming Guide and Reference Release 2011 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationCS 179: Lecture 10. Introduction to cublas
CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationLinear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo
Linear Algebra on the GPU, HPC Software Analyst SHARCNET, University of Waterloo Overview Brief introduction to GPUs CUBLAS and MAGMA libraries Developing efficient linear algebra code for the GPU - case
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationLecture 3: Introduction to CUDA
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationIntroduction to OpenACC
Introduction to OpenACC Alexander B. Pacheco User Services Consultant LSU HPC & LONI sys-help@loni.org HPC Training Spring 2014 Louisiana State University Baton Rouge March 26, 2014 Introduction to OpenACC
More informationIntroduction to Scientific Programming using GPGPU and CUDA
Introduction to Scientific Programming using GPGPU and CUDA Day 1 Sergio Orlandini s.orlandini@cineca.it Mario Tacconi m.tacconi@cineca.it 0 Hands on: Compiling a CUDA program Environment and utility:
More informationA few notes on parallel programming with CUDA
A few notes on parallel programming with CUDA Using parallel computing can significantly speed up execution and in many cases can be quite straightforward to implement. These notes focus on those simple
More informationNAG Fortran Library Routine Document F01CTF.1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More informationIntroduction to CUDA
Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More information-npool -ndiag Z/DGEMM MPI_Alltoall MPI_Isend MPI_Irecv Wilkes-2 (Cambridge) NVIDIA DGX-1 Piz Daint (CSCS) Summit Dev (ORNL) Davide (CINECA) CPU PLX NIC GPU PCIe NVLink QE-GPU CSCS QE CSCS
More informationParallel Computing. Lecture 19: CUDA - I
CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4
More informationPGI Fortran & C Accelerator Programming Model. The Portland Group
PGI Fortran & C Accelerator Programming Model The Portland Group Published: v0.72 December 2008 Contents 1. Introduction...3 1.1 Scope...3 1.2 Glossary...3 1.3 Execution Model...4 1.4 Memory Model...5
More informationGPU Memory Memory issue for CUDA programming
Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global
More informationCUDA Fortran. Programming Guide and Reference. Release The Portland Group
CUDA Fortran Programming Guide and Reference Release 2010 The Portland Group While every precaution has been taken in the preparation of this document, The Portland Group (PGI ), a wholly-owned subsidiary
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationIntroduction to CUDA 5.0
Introduction to CUDA 5.0 CUDA 5 In this article, I will introduce the reader to CUDA 5.0. I will briefly talk about the architecture of the Kepler GPU (Graphics Processing Unit) and I will show you how
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationNAG Fortran Library Routine Document F01CWF.1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationTutorial: Parallel programming technologies on hybrid architectures HybriLIT Team
Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice
More informationAllocating Storage for 1-Dimensional Arrays
Allocating Storage for 1-Dimensional Arrays Recall that if we know beforehand what size we want an array to be, then we allocate storage in the declaration statement, e.g., real, dimension (100 ) :: temperatures
More informationBlocks, Grids, and Shared Memory
Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationCUDA FORTRAN PROGRAMMING GUIDE AND REFERENCE. Version 2017
CUDA FORTRAN PROGRAMMING GUIDE AND REFERENCE Version 2017 TABLE OF CONTENTS Preface...viii Intended Audience... viii Organization... viii Conventions...viii Terminology... ix Related Publications... ix
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationUsing a GPU in InSAR processing to improve performance
Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationCUDA Fortran Programming Guide and Reference
CUDA Fortran Programming Guide and Reference Version 2014 PGI Compilers and Tools TABLE OF CONTENTS Preface... viii Intended Audience...viii Organization... viii Conventions... viii Terminology...ix Related
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs Using GPUs for mathematical problems in Fortran, Java and C# Alexey A. Romanenko arom@ccfit.nsu.ru
More informationBasics of CADA Programming - CUDA 4.0 and newer
Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN
More informationModule 2: Introduction to CUDA C. Objective
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding
More information