Blocks, Grids, and Shared Memory

Similar documents
CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPGPUGPGPU: Multi-GPU Programming

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA. Sathish Vadhiyar High Performance Computing

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CME 213 S PRING Eric Darve

CUDA Architecture & Programming Model

High Performance Computing on GPUs using NVIDIA CUDA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

1/25/12. Administrative

Tesla Architecture, CUDA and Optimization Strategies

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Hardware/Software Co-Design

CUDA Programming Model

Matrix Multiplication in CUDA. A case study

CUDA Performance Optimization. Patrick Legresley

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

High Performance Computing and GPU Programming

Introduction to CUDA (1 of n*)

Programming in CUDA. Malik M Khan

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to CUDA 5.0

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Memory Memory issue for CUDA programming

Lecture 11: GPU programming

University of Bielefeld

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

Introduction to GPU programming with CUDA

Optimizing CUDA for GPU Architecture. CSInParallel Project

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

NVIDIA Fermi Architecture

CUDA (Compute Unified Device Architecture)

ECE 574 Cluster Computing Lecture 15

Device Memories and Matrix Multiplication

Programming with CUDA, WS09

simcuda: A C++ based CUDA Simulation Framework

B. Tech. Project Second Stage Report on

ME964 High Performance Computing for Engineering Applications

Code Optimizations for High Performance GPU Computing

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

General Purpose GPU Computing in Partial Wave Analysis

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to GPGPUs and to CUDA programming model

Data Parallel Execution Model

GPU Programming. Rupesh Nasre.

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Fundamental Optimizations

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

GPU Fundamentals Jeff Larkin November 14, 2016

Unrolling parallel loops

GPU CUDA Programming

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Supporting Data Parallelism in Matcloud: Final Report

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Parallel Processing SIMD, Vector and GPU s cont.

Hands-on CUDA Optimization. CUDA Workshop

Fundamental CUDA Optimization. NVIDIA Corporation

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

ECE 574 Cluster Computing Lecture 17

Sparse Linear Algebra in CUDA

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

L3: Data Partitioning and Memory Organization, Synchronization

Warps and Reduction Algorithms

Lecture 10!! Introduction to CUDA!

Advanced CUDA Optimizations

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS 179: GPU Computing

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Memories. Introduction 5/4/11

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CUDA Parallelism Model

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

Practical Introduction to CUDA and GPU

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Kenjiro Taura 1 / 36

CS377P Programming for Performance GPU Programming - I

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Transcription:

Blocks, Grids, and Shared Memory GPU Course, Fall 2012

Last week: ax+b Homework

Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same instructions in lockstep. Only difference are thread ids Can have a grid of multiple blocks CUDA Thread Block of CUDA Threads Grid of CUDA Blocks

CUDA - H/W mapping Blocks are assigned to a particular SM Executed there one warp at a time (typically 32 threads) Multiple blocks may be on SM concurrently Good; latency hiding Bad - SM resources must be divided between blocks SM#1 SM#2 GPU If only use 1 Block - 1 SM

Multi-block y=ax+b Break input, output vectors into blocks Within each block, thread index specifies which item to work on Each thread does one update, puts results in y[i] } y[i] = a*x[i]+b } x y

Multi-block y=ax+b x } } y[i] = a*x[i]+b y block-saxpb.cu

More blocks more SMs more FLOPs On newer cards, where we can use 1024 threads/ block: Multiple calcs, so timing not dominated by memory copy GPU SM#1 SM#2

Multi-block y=ax+b x } } y[i] = a*x[i]+b y Index within block (0..blocksize-1)

Multi-block y=ax+b x } } y[i] = a*x[i]+b y Index of block (0..nblocks-1) Size of block (blocksize)

Multi-block y=ax+b Blocksize = 100 Block 2 x } } Thread 10 y i = 10 + 2*100 = 210 yd[210] = a*xd[210] + b

How many threads/ Should be integral multiple of warp (32) No more than max allowed by scheduling hardware Can get last number from hardware specs But what if will be needed on several machines? API can return it: block?

cudagetdeviceproperty querydevs.cu

cudagetdeviceproperty All CUDA calls return cudasuccess on successful completion. GPU hardware does not try very hard to catch errors/notify you; testing return codes important! Common to see simple automation like this wrapping all CUDA calls; bare minimum for sensible operation. Test early, fail often.

Why the.xs? For convenience, CUDA allows thread, block indicies to be multidimensional Thread blocks can be 3 dimensional (512,512,64) Grids of blocks can be 2 dimensional (64k, 64k, 1) These variables are of type dim3 or uint3 CUDA has int1, int2, int3, int4, float1, float2, float3, float4, etc.

Why the.xs? threadidx.{x,y,z} - thread index blockdim.{x,y,z} - size of block (# of threads in each dim) blockidx.{x,y,z} - block index griddim.{x,y,z} - size of grid (# of blocks in each dim) warpsize - size of warp (int)

Why the.xs? global - device code that can be seen (invoked) from host. host - default. Not usually interesting. device - device code. Can be called only from other device code. host device - compiled for both host and device.

Compilation process.cu file nvcc host obj code Intermediate, device-independent PTX code 2nd compilation stage device code Executable

Restrictions global functions can t recurse, neither can device on non-fermis No function pointers to device functions on non-fermis, can t take address of device function Can t have static variables in global, device functions Can t use varargs with device code

2-Dimensional Blocks Use of 2/3d thread blocks, or 2d grids, never strictly necessary... But can make code clearer, shorter. Matrix multiplication = * C i,j = X k A i,k B k,j

2-Dimensional Blocks = * C i,j = X k A i,k B k,j matmult.cu

2-Dimensional Blocks = * C i,j = X k A i,k B k,j

2-Dimensional Blocks = * C i,j = X k A i,k B k,j

Timings: Orig $./matmult --matsize=160 --nblocks=10 Matrix size = 160, Number of blocks = 10. CPU time = 14.093 millisec. GPU time = 4.416 millisec. CUDA and CPU results differ by 0.162872 Double Prec. sum $./matmult --matsize=160 --nblocks=10 Matrix size = 160, Number of blocks = 10. CPU time = 14.047 millisec. GPU time = 2.219 millisec. CUDA and CPU results differ by 0.000000 Faster, even with double precision sums - why?

CUDA Memories All HPC, but especially GPU, all about planning memory access to be fast Global mem is off the GPU chip (but on the card); ~100 cycle latency Thread-local variables get put into registers on each SM - fast (~1 cycle) but small Global Mem (On Card) Registers (On Chip) SM#1 SM#2

CUDA Memories Memory On Chip? Cached? R/W Scope Register On No R/W Thread Shared On No R/W Block Global Mem (On Card) Global Off No R/W Kernel, Host Constant Off Yes R Kernel, Host Registers (On Chip) SM#1 SM#2 Texture Off Yes R(W?) Kernel, Host Local * Off No R/W Thread * if you run out of registers, will put local mem in global.

Memory usage in SGEMM How can we exploit this? N 3 multiplies, adds 2N 2 data Regular access Opportunity for high memory re-use Need to find ways to bring data into shared memory (incurring global mem overhead once), use it several times = * C i,j = X k A i,k B k,j

Memory usage in SGEMM One nice thing about matrix multiplication - same as block multiplication, each subblock is a matrix mult Neighbouring threads within block all see nearby rows, columns Pull whole block in If b blocks in each dim, each data only pulled in 2b times, not 2n times = * C bi,bj = X k A bi,bk B bk,bj

Memory usage in SGEMM = * C bi,bj = X k A bi,bk B bk,bj

syncthreads() Computation must wait until all threads have brought in their data Not all memory accesses may take same length of time syncthreads() - waits until all threads in block are at same point. No equivalent between blocks Loop must similarly wait for computation

shared arrays If declared in device code, must be sized at compile time. No sharedmalloc (all threads in block would have to agree) Global Mem (On Card) Shared mem (On Chip) SM#1 SM#2 can use consts or #defines to size array, but we want to maintain flexibility

extern shared Optional 3rd argument - size (in bytes) of shared memory to allocate per block

extern shared Comes in as one array; can type, name it anything you like

extern shared If you want to use it for 2 things, you have to deal with that yourself.

Orig Timings (tpb2): $./matmult --matsize=160 --nblocks=10 Matrix size = 160, Number of blocks = 10. CPU time = 14.093 millisec. GPU time = 4.416 millisec. CUDA and CPU results differ by 0.162872 Double Prec. sum $./matmult --matsize=160 --nblocks=10 Matrix size = 160, Number of blocks = 10. CPU time = 14.047 millisec. GPU time = 2.219 millisec. CUDA and CPU results differ by 0.000000 Shared $/matmult Matrix size = 160, Number of blocks = 10. CPU time = 14.041 millisec. GPU time = 0.998 millisec. CUDA and CPU results differ by 0.000000

Homework Using matmult.cu as a template, look at smoothimage.c in the code I ll send out; implements image convolution. Implement a CUDA version using shared memory, and make sure it gets same answer as CPU version. How does data reuse vary as a function of the halo size?