Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Similar documents
Practical Introduction to CUDA and GPU

Tesla Architecture, CUDA and Optimization Strategies

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU programming. Dr. Bernhard Kainz

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA

Real-time Graphics 9. GPGPU

Parallel Computing. Lecture 19: CUDA - I

Lecture 10!! Introduction to CUDA!

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 3: Introduction to CUDA

Real-time Graphics 9. GPGPU

Lecture 2: Introduction to CUDA C

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Module 2: Introduction to CUDA C. Objective

Advanced CUDA Optimization 1. Introduction

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Programming Model

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Scientific discovery, analysis and prediction made possible through high performance computing.

Module 2: Introduction to CUDA C

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Architecture & Programming Model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Introduction to CUDA Programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Lecture 11: GPU programming

CUDA Programming. Aiichiro Nakano

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Programming Using CUDA

HPCSE II. GPU programming and CUDA

GPU Computing with CUDA. Part 2: CUDA Introduction

ECE 574 Cluster Computing Lecture 15

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Writing and compiling a CUDA code

ECE 574 Cluster Computing Lecture 17

Introduction to CUDA (1 of n*)

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Massively Parallel Algorithms

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA Basics. July 6, 2016

Josef Pelikán, Jan Horáček CGG MFF UK Praha

HPC with Multicore and GPUs

Parallel Numerical Algorithms

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPU Programming Introduction

CUDA Parallel Programming Model Michael Garland

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to GPGPUs and to CUDA programming model

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

High-Performance Computing Using GPUs

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Massively Parallel Architectures

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Overview: Graphics Processing Units

University of Bielefeld

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA Kenjiro Taura 1 / 36

Programming in CUDA. Malik M Khan

Technology for a better society. hetcomp.com

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Cartoon parallel architectures; CPUs and GPUs

Paralization on GPU using CUDA An Introduction

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

High Performance Linear Algebra on Data Parallel Co-Processors I

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Lecture 2: CUDA Programming

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to Parallel Computing with CUDA. Oswald Haan

Speed Up Your Codes Using GPU

Matrix Multiplication in CUDA. A case study

An Introduction to OpenAcc

Transcription:

GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics

Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth CUDA example CUDA-OpenGL binding

GPU programming How can we program the GPU? GPU Computing Applications NVIDIA CUDA C/C++ exts. OpenCL OpenGL Compute Shaders DirectX Compute Shaders Direct Compute AMD Stream Fortran Java Python GPU We will focus on C/CUDA and NVIDIA GPUs in this tutorial

GPU programming Horizontal vs. vertical application development Applications Higher level libraries Algorithms/data structure libraries Programming system primitives Hardware Application 1 Little code reuse Application 2 Programming system primitives Hardware Application 3 Application 1 Application 2 Primitive libraries (domain specific) Programming system primitives Hardware Application 3 CPU GPU (historical) GPU (today)

GPU programming Software libraries to support GPU/CUDA programming GPU Computing Applications Application AccelerationEngines SceniX, CompleX,Optix, PhysX Utility Libraries CUDPP, CUBLAS, CUFFT, CULA, NVPP, Magma Development Environment C, C++, Fortran, Python, Java, OpenCL, Direct Compute, CUDA Compute Architecture

GPU programming A few software libraries to support GPU/CUDA programming CUDPP a library of high-performance parallel primitives for GPUs Provides scan-primitives, stream compaction, radix sort, sparse matrixvector multiply, random number generation CUBLAS Basic Linear Algebra Subprograms like vector, vector/matrix, and matrix/matrix operations (subset of BLAS 1/2/3) CUFFT (http://developer.download.nvidia.com) 1D, 2D, and 3D transforms of complex and real valued data Transform sizes (2D/3D) up to 16384 in any dimension MAGMA - http://icl.cs.utk.edu/magma/ Dense linear algebra on heterogeneous CPU+GPU systems CULA by EM Photonics (http://www.culatools.com/) Emulates LAPACK on GPUs LU, QR and singular value decomposition, least squares

CUDA Compute Unified Device Architecture CUDA parallel programming abstraction Parallel computing architecture and programming model Unified hardware and software specification for parallel computing Hardware multiing General purpose programming model User launches batches of s on the GPU (application controlled SIMD program structure) Fully general load/store memory model Simple extension to standard C Not a graphics API But graphics API interoperability is possible Buffer exchange between CUDA and graphics API

CUDA Compute Unified Device Architecture CUDA is designed to fully exploit the SIMD execution paradigm underlying the GPU design Parallel kernels composed of many s All s execute the same sequential program Threads are grouped into blocks Cooperation within a block via fast, shared memory and hardware synchronization barriers Blocks are virtualized multiprocessors All blocks must be independent, implicit barrier between kernel launches Support by runtime API cudamalloc(), cudamemcpy(), cudafree() cudagetlasterror()

CUDA compiling CUDA for GPUs NVCC compiles into CPU and Parallel Thread execution code

Evaluation of a finite-difference stencil on a 2D uniform grid u ij u u i 1 j i 1 j 2u u ij i 1 j u u i 1 j ij 1 u u ij 1 ij 1 2u 4u ij ij u ij 1 y y u i-1j+1 u ij+1 u i-1j u ij u ij-1 u i+1j+1 u i+1j u i-1j-1 x = y = 1 u i+1j-1 x x

The sequential CPU code for computing the finite-difference stencil int main(float *u, int Nx, Ny) { // pointer to CPU (host) memory float *out; // allocate array on host out = (float *)malloc(sizeof(float)*nx*ny); // compute stencil for (int j=1; j<ny-1; j++) { for (int i=1; i<nx-1; i++) { out[i + j * Nx] = u[i+1 + j * Nx] + u[i-1 + j * Nx] + u[i + (j+1) * Nx] + u[i + (j-1) * Nx] 4 * u[i + j * Nx]; } // copy result to input array and free memory memcopy(out, u, sizeof(float)*nx*ny); free(out);

The parallel GPU code for computing the finite-difference stencil Requirements: Allocation of memory on the GPU device and moving data to/from that memory A kernel a function callable from the host (CPU) and executed on the device by many s in parallel An execution configuration to specify the number and grouping of parallel s used to execute the kernel Performing parallel computations based on a subdivision of the problem domain, i.e., a means to specify which part of the domain a has to work on

CUDA setup code for the stencil computation #include <stdio.h> #include <cuda.h> int main(float *u, int Nx, Ny) { // pointers to GPU (device) memory float *u_d, *out; Returns a pointer to a linear memory segment in global device memory // allocate arrays on the GPUdevice cudamalloc((void **) &u_d, sizeof(float)*nx*ny); cudamalloc((void **) &out, sizeof(float)*nx*ny);

Moving data between host and device memory. // send data from host to device: u to u_d cudamemcpy(u_d, u, sizeof(float)*nx*ny, cudamemcpyhosttodevice); Copies content of CPU array to device array, and vice versa // here: // CUDA configuration and execution of the parallel program // retrieve data from device: out to u cudamemcpy(u, out, sizeof(float)*nx*ny, cudamemcpydevicetohost); // cleanup cudafree(u_d); cudafree(out); } // end of main

CUDA configuration and execution of the parallel program Remember: s are grouped into blocks and blocks are structured into grids Blocks and grids can have multiple dimensions Threads in one block are executed on the same SM, whereas blocks can run on different SMs The block/grid abstraction allows balancing the trade-offs between running many independent s in parallel, and running blocks of s that are synchronized and can cooperate with each other The execution configuration can be requested by a via variables: blockidx: the s block index within the grid Idx: the s index within the block blockdim: the number of s in the s block

CUDA configuration and execution of the parallel program Remember: s are grouped in blocks and blocks are structured in grids Blocks and grids can have multiple dimensions // compute (2D) execution configuration const int BLOCKSIZEX = 8; // number of s per block along dim 1 const int BLOCKSIZEY = 8; // number of s per block along dim 2 int nblocksx = Nx / BLOCKSIZEX -2; // how many blocks along dim 1 int nblocksy = Ny / BLOCKSIZEY -2; // how many blocks along dim 2 dim3 dimblock(blocksizex, BLOCKSIZEY); // set values dim3 dimgrid(nblocksx, nblocksy); // call computestencilondevice kernel computestencilondevice <<< dimgrid, dimblock >>> (u_d, out, Nx, Ny); Calls kernel with input arguments using the specific configuration

The kernel the program that is executed by the s on the SPs global void computestencilondevice(float *u, float *out, int Nx, Ny) { // the in the current block int tx = Idx.x; int ty = Idx.y; // the index of the first in the current block // i.e., the base of the subgrid in the global grid int basex = blockidx.x * (blocksizex - 2) + 1; int basey = blockidx.y * (blocksizey - 2) + 1; // the grid index at which the numerical stencil // is to be evaluated by the int i = basex + tx; int j = basey + ty;

The kernel the program that is executed by the s on the SPs // allocate fast shared memory (lifetime of a block) to cache the working set // reduces the number of device memory reads by a factor of 4 shared float u_sh[blocksizex][blocksizey]; // fill the shared memory with the working set from the global device array u_sh[tx][ty] = u[i + j * Nx]; syncs(); // all s wait at this barrier for all others if(tx > 0 && tx < BLOCKSIZEX-1 && ty > 0 && ty < BLOCKSIZEY-1) { // compute the stencil on the data in shared memory and write to out array out[i + j * Nx] = u_sh[tx+1][ty] + u_sh[tx-1][ty] + u_sh[tx][ty+1] + u_sh[tx][ty-1] - 4.0 * u_sh[tx][ty]; } } // end of computestencilondevice

Parallel memory transactions Assume s are ordered in row-major order, e.g., in a 2x2 block: [0][0] 0; [1][0] 1; [0][1] 2; [1][1] 3 The device data is laid out in this way, too Per- memory reads due to u_sh[tx][ty] = u[i + j * Nx]; should look like this: Data grid One block 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 56 57 58 59 60 61 62 63

Non-coalesce parallel memory transactions Every single read from the device memory reads buckets of at least 32 bytes One single even though only requesting 4 bytes triggers the movement of 32 bytes when reading one float This results in 8 x 32 bytes to be read overall Moreover, since all s read from the same memory bank, the read operations are serialized In our case, since all requested data lies in succession, all 8 single reads are coalesce into one memory transaction of 8 x 4 = 32 bytes 32 bytes in single read operation