Introduction to CUDA (1 of n*)

Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements Many slides are from Kayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work: http://bps10.idav.ucdavis.edu/talks/03- fatahalian_gpuarchteraflop_bps_siggraph201 0.pdf David Kirk and Wen-mei Hwu s UIUC course: http://courses.engr.illinois.edu/ece498/al/ GPU Architecture Review GPUs are: Parallel Multithreaded Many-core GPUs have: Tremendous computational horsepower High memory bandwidth GPU Architecture Review GPUs are specialized for Compute-intensive, highly parallel computation Graphics! Transistors are devoted to: Processing Not: Data caching Flow control GPU Architecture Review Transistor Usage Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/cuda_c_programming_guide.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf ing Hardware in G80 Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf

Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 3D 3D API: API: 3D API Commands 3D 3D OpenGL OpenGL or or Application Application Direct3D Direct3D Or Or Game Game GPU Command & Data Stream GPU GPU Front Front End End Pre-transformed Vertices Vertex Index Stream Primitive Primitive Assembly Assembly Programmable Programmable Vertex Vertex Processor Processor Assembled Primitives Transformed Vertices Rasterization and Interpolation Pre-transformed Fragments Fixed-function pipeline CPU-GPU Boundary (AGP/PCIe) Pixel Location Stream Programmable Programmable Fragment Fragment Processor Processor Raster Operations Pixel Updates Transformed Fragments Frame Frame Buffer Buffer 3D 3D API: API: 3D API Commands 3D 3D OpenGL OpenGL or or Application Application Direct3D Direct3D Or Or Game Game GPU Command & Data Stream GPU GPU Front Front End End Vertex Index Stream Primitive Primitive Assembly Assembly Assembled Primitives CPU-GPU Boundary (AGP/PCIe) Rasterization and Interpolation Pixel Location Stream Programmable pipeline Raster Operations Pixel Updates Frame Frame Buffer Buffer 3D 3D API: API: 3D API Commands 3D 3D OpenGL OpenGL or or Application Application Direct3D Direct3D Or Or Game Game GPU Command & Data Stream GPU GPU Front Front End End Vertex Index Stream Primitive Primitive Assembly Assembly Assembled Primitives Rasterization and Interpolation Pixel Location Stream Unified Programmable pipeline CPU-GPU Boundary (AGP/PCIe) Raster Operations Pixel Updates Frame Frame Buffer Buffer Pre-transformed Vertices Programmable Programmable Vertex Vertex Processor Processor Transformed Vertices Pre-transformed Fragments Programmable Programmable Fragment Fragment Processor Processor Transformed Fragments Pre-transformed Vertices Transformed Vertices Pre-transformed Fragments Unified Unified Vertex, Vertex, Fragment, Fragment, Geometry Geometry Processor Processor Transformed Fragments General Diagram (6800/NV40) Turbo Uses PCI-Express bandwidth to render directly to system memory Card needs less memory Performance boost while lowering cost Turbo Manager dynamically allocates from main memory Local memory used to cache data and to deliver peak performance when needed

NV40 Vertex Processor NV40 Fragment Processors Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle Why NV40 series was better Massive parallelism Scalability Lower end products have fewer pixel pipes and fewer vertex shader units Computation Power 222 million transistors First to comply with Microsoft s DirectX 9 spec Dynamic Branching in pixel shaders Dynamic Branching Helps detect if pixel needs shading Instruction flow handled in groups of pixels Specify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines General Diagram (7800/G70) General Diagram (7800/G70) General Diagram (6800/NV40)

GeForce Go 7800 Power Issues Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Dynamic clock scaling can run as slow as 16 MHz This is true for the engine, memory, and pixel clocks Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won t get much battery-based runtime for a 3D game GeForce 7800 GTX Parallelism Z-Cull Triangle Setup/Raster Shader Instruction Dispatch 8 Vertex Engines 24 Pixel Shaders G80 Graphics Mode The future of GPUs is programmable processing Host So build the architecture around the Input Assembler Setup / Rstr / ZCull processor Vtx Issue Geom Issue Pixel Issue Fragment Crossbar 16 Raster Operation Pipelines L1 L1 L1 L1 L1 L1 L1 L1 Processor L2 L2 L2 L2 L2 L2 Partition Partition Partition Partition G80 CUDA mode A Example Processors execute computing threads New operating mode/hw interface for Host computing Input Assembler Execution Manager Why Use the GPU for Computing? The GPU has evolved into a very flexible and powerful processor: It s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS: Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global GFLOPS GPU in every PC and workstation G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

What is Behind such an Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control ALU Control ALU CPU DRAM ALU ALU DRAM GPU The fast-growing video game industry exerts What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications see //GPGPU.org Game effects (FX) physics image processing Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities Limited outputs Instruction sets Lack of Integer & bit ops Communication limited Between pixels Scatter a[i] = p Input Fragment Program Output Texture Constants per thread per Shader per Context Temp An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device) Arrays of Parallel s A CUDA kernel is executed by an array of threads All threads run the same code (MD) Each thread has an ID that it uses to compute memory addresses and make control decisions threadid 0 1 2 3 4 float x = input[threadid]; float y = func(x); output[threadid] = y; 5 6 7 s: Scalable Cooperation Divide monolithic thread array into multiple blocks s within a block cooperate via shared memory, atomic operations and barrier synchronization s 0in different blocks 0 cannot cooperate N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadid 0 1 2 3 4 5 6 7 float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y;

Batching: Grids and s A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) Grid 1 (0, 0) (0, 1) Grid 2 (2, 0) (2, 1) (2, 2) (1, 0) (1, 1) (3, 0) (3, 1) (3, 2) Courtesy: NDVIA (4, 0) (4, 1) (4, 2) (2, 0) (2, 1) and IDs s and blocks have IDs So each thread can decide what data to work on ID: 1D or 2D ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) Grid 1 (0, 0) (0, 1) (2, 0) (2, 1) (2, 2) (1, 0) (1, 1) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2) Courtesy: NDVIA (2, 0) (2, 1) CUDA Space Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant, and texture memories Host () Grid (0, 0) Local Global Constant Texture Shared (0, 0) (1, 0) Local (1, 0) Local Shared (0, 0) (1, 0) Local Global, Constant, and Texture Memories (Long Latency Accesses) () Grid Global memory Main means of (0, 0) communicating R/W Shared Data between host and device Contents visible to all (0, 0) (1, 0) threads Texture and Constant Local Local Local Memories Constants initialized by host Contents visible to all threads Host Global Constant Texture (1, 0) Shared (0, 0) Courtesy: NDVIA (1, 0) Local IDs and IDs Each thread uses IDs to decide what data to work on ID: 1D or 2D ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on Host Kernel 1 Grid 1 Grid 2 (0, 0) (0, 1) (1, 0) (1, 1) Kernel 2 (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) (0,0,0) (1,0,0) (2,0,0) (3,0,0) (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA CUDA Model Overview Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access We will focus on global memory for now Host Grid (0, 0) Shared (0, 0) Global (1, 0) (1, 0) Shared (0, 0) (1, 0)

Parallel Computing on a GPU 8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications Available in laptops, desktops, and clusters GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Tesla D870 GeForce 8800 Tesla S870 Single-Program Multiple-Data (MD) CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU CPU Serial Code thread blocks GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args); CPU Serial Code GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args);...... Grid 0 Grid 1 Grids and s A kernel is executed as a grid of thread blocks All threads share global memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution using barrier Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot Host Grid 1 Kernel 1 (0, 0) (1, 0) (0, 1) (1, 1) Grid 2 Kernel 2 (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) (0,0,0) (1,0,0) (2,0,0) (3,0,0) (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA CUDA Programmer declares () : size 1 to 512 concurrent threads CUDA shape 1D, 2D, or 3D Id #: 0 1 2 3 m dimensions in threads All threads in a execute the same thread program s share data and synchronize while doing their share of the work s have thread id numbers within program Courtesy: John Nickolls, NVIDIA GeForce-8 Series HW Overview Streaming Processor Array TPC TPC TPC TPC TPC TPC Texture Processor Cluster Streaming Multiprocessor Instruction L1 Data L1 SM Instruction Fetch/Dispatch Shared TEX SM SFU SFU CUDA Processor Terminology A Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) TPC Texture Processor Cluster (2 SM + TEX) SM Streaming Multiprocessor (8 ) Multi-threaded processor core Fundamental processing unit for CUDA thread block Streaming Processor Scalar ALU for a single CUDA thread

Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) 8 Streaming Processors () 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory texture and global memory access Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared SFU SFU G80 Computing Pipeline The Processors future of execute GPUs is computing programmable threads processing Alternative operating mode specifically for Host Generates So computing build the architecture grids based on around the Input Input Assembler kernel calls Setup / Rstr / ZCull processor Execution Manager Vtx Issue Geom Issue Pixel Issue Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Load/store L2 Load/storeL2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 Global Processor Life Cycle Host in HW Grid is launched on the A s are serially distributed to all the SM s Potentially >1 per SM Each SM launches Warps of s 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and s complete, Kernel 1 Kernel 2 (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) Grid 1 (0, 0) (0, 1) Grid 2 (2, 0) (2, 1) (2, 2) (1, 0) (1, 1) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2) (2, 0) (2, 1) SM Executes s t0 t1 t2 tm s SM 0 SM 1 MT IU Shared MT IU Shared Texture L1 L2 t0 t1 t2 tm s s are assigned to SMs in granularity Up to 8 s to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. s run concurrently Scheduling/Execution Each s is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each has 256 threads, how many Warps are there in an SM? Each is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps 1 Warps t0 t1 t2 t31 SFU 2 Warps t0 t1 t2 t31 Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared SFU time SM Warp Scheduling SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to

SM Instruction Buffer Warp Scheduling Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one ready-to-go warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on roundrobin/age of warp SM broadcasts the same instruction I$ L1 Multithreaded Instruction Buffer R C$ Shared F L1 Mem Operand Select MAD SFU Scoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled /Processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows /Processor ops to proceed in shadow of other waiting /Processor ops Granularity Considerations For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles? For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! There are 8 warps but each warp is only half full. Hardware in G80 For 8X8, we have 64 threads per. Since each SM can take up to 768 threads, it could take up to 12 s. However, each SM can only take up to 8 s, only 512 threads will go into each SM! There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension For 16X16, we have 256 threads per. Since each SM can take up to 768 threads, it can take up to 3 s and achieve full capacity unless other resource considerations overrule. There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension CUDA Space: Review Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant Host R/W memory global, constant, Read only and per-grid texture memory The host can texture memories () Grid (0, 0) Local Global Constant Texture Shared (0, 0) (1, 0) Local (1, 0) Local Shared (0, 0) (1, 0) Local Parallel Sharing Grid 0 Grid 1 Local Shared...... Local : perthread Private per thread Auto variables, register spill Shared : per- Shared by threads of the same block Inter-thread communication Global Sequential Global : Gridsper- application in Time Shared by all threads Inter-Grid communication

SM Architecture t0 t1 t2 tm s SM 0 SM 1 MT IU Shared Courtesy: John Nicols, NVIDIA MT IU Shared Texture L1 L2 t0 t1 t2 tm s s in a block share data & results In and Shared Synchronize at barrier instruction Per- Shared Allocation Keeps data close to processor SM Register File Register File (RF) 32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF I$ L1 Multithreaded Instruction Buffer R C$ Shared F L1 Mem Operand Select MAD SFU Programmer View of Register File 4 blocks 3 blocks There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA are dynamically partitioned across all blocks assigned to the SM Once assigned to a Matrix Multiplication Example If each has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each block requires 10*256 = 2560 registers 8192 = 3 * 2560 + change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each now requires 11*256 = 2816 registers 8192 < 2816 *3 More on Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level parallelism Let s program this thing!

GPU Computing History 2001/2002 researchers see GPU as dataparallel coprocessor The GPGPU field is born 2007 NVIDIA releases CUDA CUDA Compute Uniform Architecture GPGPU shifts to GPU Computing 2008 Khronos releases OpenCL specification CUDA Abstractions A hierarchy of thread groups Shared memories Barrier synchronization CUDA Terminology Host typically the CPU Code written in ANSI C typically the GPU (data-parallel) Code written in extended ANSI C Host and device have separate memories CUDA Program Contains both host and device code CUDA Terminology Kernel data-parallel function Invoking a kernel creates lightweight threads on the device s are generated and scheduled with hardware Does a kernel remind you of a shader in OpenGL? CUDA Kernels CUDA Program Execution Executed N times in parallel by N different CUDA threads Declaration Specifier ID Execution Configuration Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf

Hierarchies Hierarchies Grid one or more thread blocks 1D or 2D array of threads 1D, 2D, or 3D Each block in a grid has the same number of threads Each thread in a block can Synchronize Access shared memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Hierarchies 1D, 2D, or 3D Example: Index into vector, matrix, volume Hierarchies ID: Scalar thread identifier Index: threadidx 1D: ID == Index 2D with size (D x, D y ) ID of index (x, y) == x + y D y 3D with size (D x, D y, D z ) ID of index (x, y, z) == x + y D y + z D x D y Hierarchies Hierarchies 2D Index Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core 1 2D

Hierarchies Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core Hierarchies Index: blockidx Dimension: blockdim 1D or 2D Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Hierarchies Hierarchies 16x16 s per block Example: N = 32 16x16 threads per block (independent of N) threadidx ([0, 15], [0, 15]) 2x2 thread blocks in grid blockidx ([0, 1], [0, 1]) blockdim = 16 2D i = [0, 1] * 16 + [0, 15] Hierarchies Hierarchies blocks execute independently In any order: parallel or series Scheduled in any order by any number of cores Allows code to scale with core count Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/cuda_c_programming_guide.pdf

Hierarchies CUDA Transfers s in a block Share (limited) low-latency memory Synchronize execution To coordinate memory accesses syncs() Barrier threads in block wait until all threads reach this Lightweight Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf CUDA Transfers Host can transfer to/from device Global memory Constant memory CUDA Transfers cudamalloc() Allocate global memory on device cudafree() Frees memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf CUDA Transfers CUDA Transfers Pointer to device memory

CUDA Transfers CUDA Transfers cudamemcpy() Size in bytes transfer Host to host Host to device to host to device Hos t Global Does this remind you of VBOs in OpenGL? CUDA Transfers CUDA Transfers cudamemcpy() cudamemcpy() transfer Host to host Host to device to host to device Hos t Global transfer Host to host Host to device to host to device Hos t Global CUDA Transfers CUDA Transfers cudamemcpy() cudamemcpy() transfer Host to host Host to device to host to device Hos t Global transfer Host to host Host to device to host to device Hos t Global All transfers are asynchronous

CUDA Transfers CUDA Transfers Host to device Destination (device) Source (host) Hos t Global Hos t Global CUDA Transfers Matrix Multiply P = M * N Assume M and N are square for simplicity Is this data-parallel? Hos t Global Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Matrix Multiply 1,000 x 1,000 matrix 1,000,000 dot products Each 1,000 multiples and 1,000 adds Matrix Multiply: CPU Implementation void MatrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; } } Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt

Matrix Multiply: CUDA Skeleton Matrix Multiply: CUDA Skeleton Matrix Multiply: CUDA Skeleton Matrix Multiply Step 1 Add CUDA memory transfers to the skeleton Matrix Multiply: Data Transfer Matrix Multiply: Data Transfer Allocate input Allocate output

Matrix Multiply: Data Transfer Matrix Multiply: Data Transfer Read back from device Matrix Multiply: Data Transfer Matrix Multiply Step 2 Implement the kernel in CUDA C Does this remind you of GPGPU with GLSL? Matrix Multiply: CUDA Kernel Matrix Multiply: CUDA Kernel Accessing a matrix, so using a 2D block Each kernel computes one output

Matrix Multiply: CUDA Kernel Matrix Multiply: CUDA Kernel Where did the two outer for loops in the CPU implementation go? No locks or synchronization, why? Matrix Multiply Step 3 Invoke the kernel in CUDA C Matrix Multiply: Invoke Kernel One block with width by width threads Matrix Multiply One of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block Md Pd David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign 125 Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt Grid 1 1 (2, 2) 3 2 5 4 WIDTH Nd 2 4 2 6 48 Matrix Multiply What is the major performance problem with our implementation? What is the major limitation?