Nvidia G80 Architecture and CUDA Programming

Similar documents
An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

ECE/CS 757: Advanced Computer Architecture II GPGPUs

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

1/25/12. Administrative

Mattan Erez. The University of Texas at Austin

Threading Hardware in G80

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Matrix Multiplication in CUDA. A case study

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA (1 of n*)

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

GPU Computing with CUDA

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Computation to Core Mapping Lessons learned from a simple application

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lessons learned from a simple application

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPU Computing with CUDA

Parallel Computing. Lecture 19: CUDA - I

Advanced Computing for Engineering Applications

Lecture 3: Introduction to CUDA

Programming in CUDA. Malik M Khan

GPGPU. Lecture 2: CUDA

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Data Parallel Execution Model

Lecture 2: CUDA Programming

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Tesla Architecture, CUDA and Optimization Strategies

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Intro to GPU s for Parallel Computing

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

GPU Memory Memory issue for CUDA programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Using The CUDA Programming Model

Introduction to CELL B.E. and GPU Programming. Agenda

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduction to CUDA (1 of n*)

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

ME964 High Performance Computing for Engineering Applications

Module 2: Introduction to CUDA C

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Lecture 2: Introduction to CUDA C

CSE 160 Lecture 24. Graphical Processing Units

Massively Parallel Architectures

CUDA Programming Model

CUDA Parallelism Model

Introduction to GPGPUs and to CUDA programming model

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COSC 6374 Parallel Computations Introduction to CUDA

Lecture 1: Introduction

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Portland State University ECE 588/688. Graphics Processors

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU Computing with CUDA

Lecture 5. Performance Programming with CUDA

Device Memories and Matrix Multiplication

GPU Computing with CUDA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CUDA Performance Considerations (2 of 2)

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Tiled Matrix Multiplication

CS 677: Parallel Programming for Many-core Processors Lecture 1

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA C Programming Mark Harris NVIDIA Corporation

Module Memory and Data Locality

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to GPGPU and GPU-architectures

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Spring Prof. Hyesoon Kim

Module 2: Introduction to CUDA C. Objective

ME964 High Performance Computing for Engineering Applications

Convolution, Constant Memory and Constant Caching

CUDA Performance Optimization Mark Harris NVIDIA Corporation

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

ME964 High Performance Computing for Engineering Applications

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture


ME964 High Performance Computing for Engineering Applications

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 574 Cluster Computing Lecture 15

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Parallel Numerical Algorithms

Introduction to CUDA

Programming with CUDA, WS09

GPU Computing with CUDA

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Transcription:

Nvidia G80 Architecture and CUDA Programming School of Electrical Engineering and Computer Science CUDA Programming Model: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

Batching: Grids and s A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 (1, 1) (0, 0) Device (1, 0) Grid 1 (0, 0) (0, 1) Grid 2 (2, 0) (1, 0) (1, 1) (3, 0) (4, 0) (2, 0) (2, 1) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) Courtesy: NDVIA (4, 2) and IDs s and blocks have IDs So each thread can decide what data to work on ID: 1D or 2D ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Device (1, 1) (0, 0) (1, 0) Grid 1 (0, 0) (0, 1) (2, 0) (1, 0) (1, 1) (3, 0) (4, 0) (2, 0) (2, 1) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA

CUDA Device Space Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid (0, 0) Shared Registers Registers (0, 0) (1, 0) (1, 0) Shared Registers Registers (0, 0) (1, 0) The host can R/W global, constant, and texture memories Host Global Constant Texture Global, Constant, and Texture Memories (Long Latency Accesses) Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host Contents visible to all threads Host (Device) Grid (0, 0) Registers Global Shared (0, 0) Registers (1, 0) (1, 0) Registers Shared (0, 0) Registers (1, 0) Constant Texture Courtesy: NDVIA

GeForce-8 8 Series HW Overview Streaming Processor Array TPC TPC TPC TPC TPC TPC Texture Processor Cluster SM Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch TEX Shared SM SFU SFU CUDA Processor Terminology A Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800, 30 in GTX280) TPC Texture Processor Cluster (2 SM + TEX) SM Streaming Multiprocessor (8 ) Multi-threaded processor core Fundamental processing unit for CUDA thread block Streaming Processor Scalar ALU for a single CUDA thread

Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) 8 Streaming Processors () 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 768 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared SFU SFU DRAM texture and memory access G80 Computing Pipeline The Processors future of execute GPUscomputing is programmable threadsprocessing So Alternative build the operating architecture mode around specifically the processor for computing Host Input Input Assembler Execution Manager Vtx Issue Generates grids based on kernel calls Geom Issue Setup / Rstr / ZCull Pixel Issue Parallel TF Data Parallel TF Data Parallel Data TF Parallel Data TF Parallel TF Data Parallel TF Data Parallel TF Data Parallel TF Data Cache Cache Cache Cache Cache Cache Cache Cache Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Processor Load/store L2 Load/storeL2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 From the lecture FB notes of ECE 498 AL FBby D. Kirk and W. Hwu FB Global FB FB FB

Life Cycle in HW Grid is launched on the A s are serially distributed to all the SM s Potentially >1 per SM Each SM launches Warps of s 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and s complete, resources are freed A can distribute more s Host Kernel 1 Kernel 2 (1, 1) (0, 0) Device (1, 0) Grid 1 (0, 0) (0, 1) Grid 2 (2, 0) (1, 0) (1, 1) (3, 0) (4, 0) (2, 0) (2, 1) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) t0 t1 t2 tm s SM 0 MT IU Shared TF SM 1 MT IU Shared Texture L1 SM Executes s L2 t0 t1 t2 tm s execution s are assigned to SMs in granularity Up to 8 s to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. s run concurrently SM assigns/maintains thread id #s SM manages/schedules thread

Scheduling/Execution Each s is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM Warps use the SIMD execution model If 3 blocks are assigned to an SM and each has 256 threads, how many Warps are there in an SM? Each is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. 1 Warps t0 t1 t2 t31 SFU 2 Warps t0 t1 t2 t31 Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared SFU SM Warp Scheduling time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency warp 3 instruction 96

A Simple Running Example Matrix Multiplication A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later, register usage ID usage data transfer API between host and device An Example: Matrix Multiplication P = M X N Simple code in C void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; double b = N.elements[k * N.width + j]; sum += a * b; P.elements[i * N.width + j] = sum; Data Structure typedef struct { int width; int height; int pitch; float* elements; Matrix; Optimizing the CPU code lays a solid foundation to optimize GPU code.

Analyzing the matrix multiplication (CPU) code # of instructions to be executed # of memory access instructions (i.e., loads) to be executed 2 * M.height * N.Width * M.width Loading each element in M for N.Width times Loading each element in N for M.Height times The ratio of computation over memory access instructions For every two loads, one multiply and one add For CPU, cache locality (spatial and temporal) help to reduce the load latencies. For large M and N, temporal locality is low Optimization? Unroll and Jam. CUDA Device Allocation cudamalloc() Allocates object in the device Global Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudafree() Frees object from device Global Pointer to freed object Host (Device) Grid (0, 0) Shared Register s (0, 0) Memor y Global Constant Texture Register s (1, 0) Memor y (1, 0) Memor y Shared Register s (0, 0) Register s (1, 0) Memor y

Code example: CUDA Device Allocation (cont.) Allocate a 64 * 64 single precision float array Attach the allocated storage to Md.elements d is often used to indicate a device data structure = 64; Matrix Md int size = * * sizeof(float); cudamalloc((void**)&md.elements, size); cudafree(md.elements); CUDA Host-Device Data Transfer cudamemcpy() memory data transfer Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous in CUDA 1.0 Host (Device) Grid (0, 0) Shared Register s (0, 0) Memor y Global Constant Register s (1, 0) Memor y (1, 0) Memor y Shared Register s (0, 0) Register s (1, 0) Memor y Texture

CUDA Host-Device Data Transfer (cont.) Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudamemcpyhosttodevice and cudamemcpydevicetohost are symbolic constants cudamemcpy(md.elements, M.elements, size, cudamemcpyhosttodevice); cudamemcpy(m.elements, Md.elements, size, cudamemcpydevicetohost); CUDA Function Declarations device float DeviceFunc() global void KernelFunc() Executed on the: device device Only callable from the: device host host float HostFunc() host host global defines a kernel function Must return void device and host can be used together

CUDA Function Declarations (cont.) device functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments Calling a Kernel Function Creation A kernel function must be called with an execution configuration: global void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 Dim(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, Dim, SharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

Programming Model: Parallelizing Matrix Multiplication P = M * N of size x Without tiling: One thread handles one element of P M and N are loaded times from global memory N M P Step 1: Matrix Data Transfers // Allocate the device memory where we will copy M to Matrix Md; Md.width = ; Md.height = ; Md.pitch = ; int size = * * sizeof(float); cudamalloc((void**)&md.elements, size); // Copy M from the host to the device cudamemcpy(md.elements, M.elements, size, cudamemcpyhosttodevice); // Read M from the device to the host into P cudamemcpy(p.elements, Md.elements, size, cudamemcpydevicetohost);... // Free device memory cudafree(md.elements);

Step 2: Matrix Multiplication A Simple Host Code in C // Matrix multiplication on the (CPU) host in double precision // for simplicity, we will assume that all dimensions are equal void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; double b = N.elements[k * N.width + j]; sum += a * b; P.elements[i * N.width + j] = sum; Multiply Using One One of threads compute matrix P Each thread computes one element of P Each thread Loads a row of matrix M Loads a column of matrix N Perform one multiply and addition for each pair of M and N elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block Grid 1 1 (2, 2) 3 2 5 4 N 2 4 2 6 48 M P

Step 3: Matrix Multiplication Host-side side Main Program Code int main(void) { // Allocate and initialize the matrices Matrix M = AllocateMatrix(,, 1); Matrix N = AllocateMatrix(,, 1); Matrix P = AllocateMatrix(,, 0); // M * N on the device MatrixMulOnDevice(M, N, P); M a t r i x M = A l l o c a t e M a t r i x ( B L O C K _ S I Z M a t i x N = A l l o c a t e M a t r i x ( B L O C K r M a t r i x P = A l l o c a t e M a t r i x ( B L O C K M a t r i x D P h = A l l o c a t e M a t r i x D ( B L O C // Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; Step 3: Matrix Multiplication Host-side side code // Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N); // Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // Clear memory

Step 3: Matrix Multiplication Host-side side Code (cont.) // Setup the execution configuration dim3 dim(, ); dim3 dimgrid(1, 1); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dim>>>(md, Nd, Pd); // Read P from the device CopyFromDeviceMatrix(P, Pd); // Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); Step 4: Matrix Multiplication Device-side Kernel Function // Matrix multiplication kernel thread specification global void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D ID int tx = threadidx.x; int ty = threadidx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

Step 4: Matrix Multiplication Device-Side Kernel Function (cont.) for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; // Write the matrix to device M memory; P // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue; N tx ty Step 5: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudamalloc((void**)&mdevice.elements, size); return Mdevice; // Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudafree(m.elements); void FreeMatrix(Matrix M) { free(m.elements);

Step 5: Some Loose Ends (cont.) // Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) { int size = Mhost.width * Mhost.height * sizeof(float); cudamemcpy(mdevice.elements, Mhost.elements, size, cudamemcpyhosttodevice); // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) { int size = Mdevice.width * Mdevice.height * sizeof(float); cudamemcpy(mhost.elements, Mdevice.elements, size, cudamemcpydevicetohost); Step 6: Handling Arbitrary Sized Square Matrices Have each 2D thread block to compute a (BLOCK_) 2 sub-matrix (tile) of the result matrix Each has (BLOCK_) 2 threads Generate a 2D Grid of (/BLOCK_) 2 blocks N You still need to put a loop around the kernel call for cases where is greater than Max grid size! M P bx by ty tx

How about performance? All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/flops 86.4 GB/s limits the code at 21.6 GFLOPS The actual code should run at about 15 GFLOPS Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Registers Processor 1 Shared Registers Processor 2 Registers Processor M Instruction Unit Constant Cache Texture Cache Device memory Global, constant, texture memories Developing High Performance Multithreaded Programs Can be very complex, application dependent General Guidelines Improving Parallelism (thread level). #of thread blocks, #of threads in a block Optimizing memory usage to achieve high memory bandwidth -level parallelism coalescing Reduce memory accesses Improving the instruction throughput Those goals may conflict. E.g., increase number of insns to get higher parallelism Additional hardware constraints due to registers, memory sizes, etc.

Idea # 1: Use Shared to reuse global memory data Each input element is read by threads. If we load each element into Shared and have several threads use the local version, we can drastically reduce the memory bandwidth Tiled algorithms Tiled Multiply Using s One block computes one square submatrix P sub of size One thread computes one element of P sub Assume that the dimensions of M and N are multiples of and square shape by 0 1 0 1 2 ty bsize-1 M N P bx 0 1 2 tx 012 bsize-1 P sub 2

Shared Usage Each SMP has 16KB shared memory Each uses 2*256*4B = 2KB of shared memory. Can potentially have up to 8 s actively executing For = 16, this allows up to 8*512 = 4,096 pending loads In practice, there will probably be up to half of this due to scheduling to make use of s. The next 32 would lead to 2*32*32*4B= 8KB shared memory usage per, allowing only up to two s active at the same time First-order Size Considerations Each should have a minimal of 192 threads of 16 gives 16*16 = 256 threads A minimal of 32 s A 1024*1024 P Matrix gives 64*64 = 4096 s Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. bandwidth no longer a limiting factor

CUDA Code Kernel Execution Configuration // Setup the execution configuration dim3 dim(, ); dim3 dimgrid(n.width / dim.x, M.height / dim.y); For very large N and M dimensions, one will need to add another level of blocking and execute the second-level blocks sequentially. CUDA Code Kernel Overview // index int bx = blockidx.x; int by = blockidx.y; // index int tx = threadidx.x; int ty = threadidx.y; // Pvalue stores the element of the block sub-matrix // that is computed by the thread float Pvalue = 0; // Loop over all the sub-matrices of M and N // required to compute the block sub-matrix for (int m = 0; m < M.width/; ++m) { code from the next few slides ;

Multiply Using Several s One block computes one square submatrix P sub of size One thread computes one element of P sub Assume that the dimensions of M and N are multiples of and square shape by 0 1 0 1 2 ty bsize-1 M N P bx 0 1 2 tx 012 bsize-1 P sub M.height N.height 2 M.width N.width CUDA Code - Load Data to Shared // Get a pointer to the current sub-matrix Msub of M Matrix Msub = GetSubMatrix(M, m, by); // Get a pointer to the current sub-matrix Nsub of N Matrix Nsub = GetSubMatrix(N, bx, m); shared float Ms[][]; shared float Ns[][]; // each thread loads one element of the sub-matrix Ms[ty][tx] = GetMatrixElement(Msub, tx, ty); // each thread loads one element of the sub-matrix Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);

Multiply Using Several s One block computes one square submatrix P sub of size One thread computes one element of P sub Assume that the dimensions of M and N are multiples of by 0 1 0 1 2 ty bsize-1 M N P bx 0 1 2 tx 012 bsize-1 P sub M.height N.height 2 M.width N.width CUDA Code - Compute Result // Synchronize to make sure the sub-matrices are loaded // before starting the computation syncthreads(); // each thread computes one element of the block sub-matrix for (int k = 0; k < ; ++k) Pvalue += Ms[ty][k] * Ns[k][tx]; // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of M and N in the next iteration syncthreads();

Shared Bank Conflicts s in the same Warp may have bank conflict for Nsub accesses This should be minimal since the warp likely spans the horizontal direction, resulting in broadcast of Msub accesses and no/little conflict for N accesses by 0 1 0 1 2 ty bsize-1 M N P bx 0 1 2 tx 012 bsize-1 P sub M.height N.height 2 M.width N.width CUDA Code - Save Result // Get a pointer to the block sub-matrix of P Matrix Psub = GetSubMatrix(P, bx, by); // Write the block sub-matrix to device memory; // each thread writes one element SetMatrixElement(Psub, tx, ty, Pvalue); This code should run at about 45 GFLOPS

Idea # 2: Use unrolling & jam to reuse global memory data Each thread processes more than 1 element in P Multiple elements end with reusing the global memory data Which loop to unroll? Which one does the CPU code favor? Which one does the GPU code favor? Can we take advantage of the cache for const memory? Kernel Code for Unroll and Jam (with a unroll factor of 2 in outer loop) void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; i += 2) for (int j = 0; j < N.width; ++j) { double sum1 = 0; double sum2 = 0; for (int k = 0; k < M.width; ++k) { double a1 = M.elements[i * M.width + k]; double b = N.elements[k * N.width + j]; double a2 = M.elements[(i + 1)* M.width + k]; sum1 += a1 * b; sum2 += a2 * b; P.elements[i * N.width + j] = sum1; P.elements[(i+1) * N.width + j] = sum2;

Kernel Code for Unroll and Jam (with a unroll factor of 2 in inner loop) void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) for (int j = 0; j < N.width; j+=2) { double sum1 = 0; double sum2 = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; double b1 = N.elements[k * N.width + j]; double b2 = N.elements[k * N.width + j +1]; sum1 += a * b1; sum2 += a * b2; P.elements[i * N.width + j] = sum1; P.elements[i * N.width + j + 1] = sum2; Tradeoff Reduced loads High register usage (saving those in shared memory?) Reduced parallelism (i.e., number of thread blocks)