CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Similar documents
GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

GPU Memory Memory issue for CUDA programming

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Matrix Multiplication in CUDA. A case study

Introduction to CUDA (2 of 2)

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Introduction to GPU programming. Introduction to GPU programming p. 1/17

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Module Memory and Data Locality

Introduction to GPGPUs and to CUDA programming model

Tiled Matrix Multiplication

Lessons learned from a simple application

Computation to Core Mapping Lessons learned from a simple application

1/25/12. Administrative

CS 193G. Lecture 4: CUDA Memories

Data Parallel Execution Model

Device Memories and Matrix Multiplication

Using The CUDA Programming Model

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

GPU programming basics. Prof. Marco Bertini

Module 3: CUDA Execution Model -I. Objective

Module Memory and Data Locality

GPGPU. Lecture 2: CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Performance Considerations (2 of 2)

L3: Data Partitioning and Memory Organization, Synchronization

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Programming in CUDA. Malik M Khan

Dense Linear Algebra. HPC - Algorithms and Applications

High Performance Computing and GPU Programming

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

More CUDA. Advanced programming

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CME 213 S PRING Eric Darve

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CS 677: Parallel Programming for Many-core Processors Lecture 6

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Introduction to CUDA (1 of n*)

Tesla Architecture, CUDA and Optimization Strategies

Lecture 1: Introduction

Nvidia G80 Architecture and CUDA Programming

Double-Precision Matrix Multiply on CUDA

Efficient Data Transfers

Lecture 5. Performance Programming with CUDA

ECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Parallel Computing. Lecture 19: CUDA - I

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Lecture 3: Introduction to CUDA

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA

ECE 408 / CS 483 Final Exam, Fall 2014

Lab 1 Part 1: Introduction to CUDA

CUDA Memories. Introduction 5/4/11

Lecture 2: CUDA Programming

Blocks, Grids, and Shared Memory

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

COSC 6374 Parallel Computations Introduction to CUDA

CUDA Programming Model

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to GPGPU and GPU-architectures

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Massively Parallel Architectures

CUDA Parallelism Model

CUDA Performance Optimization. Patrick Legresley

1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture

CS : Many-core Computing with CUDA

Sparse Linear Algebra in CUDA

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

DRAM Bank Organization

Programmable Graphics Hardware (GPU) A Primer

Administrative Issues. L10: Dense Linear Algebra on GPUs. Triangular Solve (STRSM) Outline 2/25/11. Next assignment, linear algebra

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

GPU Computing with CUDA

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Programming with CUDA, WS09

Introduction to CUDA 5.0

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Lecture 2: different memory and variable types

Introduction to CUDA Programming

Transcription:

CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University

Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating point instructions compute-to-global-memory-access ratio (CGMA) = 1 G80 supports 86.4 GB/s memory access bandwidth a 4-byte float data access limits bandwidth to 86.4/4=21.6 GB/s get 21.6 GFlops (much lower than the peak 367 Gflops)

G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Block (0, 0) Block (1, 0) Read/write per-thread local memory Shared Memory Registers Registers Shared Memory Registers Registers Read/write per-block shared memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Read/write per-grid global memory Host Global Memory Read/only per-grid constant memory Constant Memory 3

CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global grid application device constant int ConstantVar; constant grid application device is optional when used with local, shared, or constant Automatic variables without any qualifier reside in a register Except arrays that reside in local memory 4

Variable Memory Types register shared/global 5

Where to Declare Variables? Can host access it? global constant yes no register (automatic) shared local Outside of any Function In the kernel 6

Variable Type Restrictions Pointers can only point to memory allocated or declared in global memory: Allocated in the host and passed to the kernel: global void KernelFunc(float* ptr) Obtained as the address of a global variable: float* ptr = &GlobalVar; 7

A Common Programming Strategy Global memory resides in device memory (DRAM) - much slower access than shared memory (16kB) So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: Partition data into subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory 8

A Common Programming Strategy (Cont.) Constant memory also resides in device memory (DRAM) - much slower access than shared memory But cached! Highly efficient access for read-only data Carefully divide data according to access patterns R/Only constant memory (very fast if in cache) R/W shared within Block shared memory (very fast) R/W within each thread registers (very fast) R/W inputs/results global memory (very slow) For texture memory usage, see NVIDIA document. 9

GPU Atomic Integer Operations Atomic operations on integers in global memory: Associative operations on signed/unsigned ints add, sub, min, max,... and, or, xor Increment, decrement Exchange, compare and swap Requires hardware with compute capability 1.1 and above. 10 10

Matrix Multiplication using Shared Memory 11

Review: Matrix Multiplication Kernel using Multiple Blocks global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y*tile_width + threadidx.y; // Calculate the column idenx of Pd and N int Col = blockidx.x*tile_width + threadidx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; } 12

TILE_WIDTHE Matrix Multiplication Using Multiple Blocks Break-up Pd into tiles Each block calculates one tile Each thread calculates one element Block size equal tile size Md Nd Pd bx 0 1 2 tx 012 TILE_WIDTH-1 WIDTH WIDTH 0 by 1 ty 0 1 2 Pd sub TILE_WIDTH-1 2 WIDTH TILE_WIDTH WIDTH ECE498AL, University of Illinois, Urbana-Champaign 13

How about performance on G80? All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/flops 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 Host GFLOPS Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS Grid Block (0, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) 14

WIDTH WIDTH Idea: Use Shared Memory to reuse global memory data Each input element is read by Width threads. N Load each element into Shared Memory and have several threads use the local version to M reduce the memory bandwidth Tiled algorithms P tx ty WIDTH WIDTH 15

TILE_WIDTHE TILE_WIDTH TILE_WIDTH Tiled Multiply Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd Nd bx 0 1 2 tx 012 TILE_WIDTH-1 WIDTH WIDTH Md Pd 0 by 1 ty 0 1 2 Pd sub TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH TILE_WIDTH 2 WIDTH WIDTH 16

A Small Example Nd 0,0 Nd 1,0 Nd 0,1 Nd 1,1 Nd 0,2 Nd 1,2 Nd 0,3 Nd 1,3 Md 0,0 Md 1,0 Md 2,0 Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0 Md 0,1 Md 1,1 Md 2,1 Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1 Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2 Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3 17

Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P 0,0 thread 0,0 P 1,0 thread 1,0 P 0,1 thread 0,1 P 1,1 thread 1,1 M 0,0 * N 0,0 M 0,0 * N 1,0 M 0,1 * N 0,0 M 0,1 * N 1,0 Access order M 1,0 * N 0,1 M 1,0 * N 1,1 M 1,1 * N 0,1 M 1,1 * N 1,1 M 2,0 * N 0,2 M 2,0 * N 1,2 M 2,1 * N 0,2 M 2,1 * N 1,2 M 3,0 * N 0,3 M 3,0 * N 1,3 M 3,1 * N 0,3 M 3,1 * N 1,3 18

Breaking Md and Nd into Tiles Nd 0,0 Nd 1,0 Nd 0,1 Nd 1,1 Nd 0,2 Nd 1,2 Nd 0,3 Nd 1,3 Md 0,0 Md 1,0 Md 2,0 Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0 Md 0,1 Md 1,1 Md 2,1 Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1 Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2 Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3 19

Each phase of a Thread Block uses one tile from Md and one from Nd Phase 1 Phase 2 Step 4 Step 5 Step 6 T 0,0 Md 0,0 Mds 0,0 Nd 0,0 Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 Md 2,0 Mds 0,0 Nd 0,2 Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 T 1,0 Md 1,0 Mds 1,0 Nd 1,0 Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 Md 3,0 Mds 1,0 Nd 1,2 Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 T 0,1 Md 0,1 Mds 0,1 Nd 0,1 Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 Md 2,1 Mds 0, 1 Nd 0,3 Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 T 1,1 Md 1,1 Mds 1,1 Nd 1,1 Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 Md 3,1 Mds 1,1 Nd 1,3 Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 time 20

First-order Size Considerations in G80 Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads There should be many thread blocks A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. Memory bandwidth no longer a limiting factor 21

Locality This scheme enforces locality focus of computation on a subset of data elements allows one to use small but high-speed memory for fast computation this exploit matches fast processors with high memory bandwidth and so maximizes the performance locality useful in any multi-core configurations 22

CUDA Code Kernel Execution Configuration // Setup the execution configuration dim3 dimblock(tile_width, TILE_WIDTH); dim3 dimgrid(width / TILE_WIDTH, Width / TILE_WIDTH); 23

Tiled Matrix Multiplication Kernel global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. shared float Mds[TILE_WIDTH][TILE_WIDTH]; 2. shared float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockidx.x; int by = blockidx.y; 4. int tx = threadidx.x; int ty = threadidx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; 10. Nds[ty][tx] = Nd[Col + (m*tile_width + ty)*width]; 11. syncthreads(); 11. for (int k = 0; k < TILE_WIDTH; ++k) 12. Pvalue += Mds[ty][k] * Nds[k][tx]; 13. Synchthreads(); 14. } 13. Pd[Row*Width+Col] = Pvalue; } 24

TILE_WIDTHE TILE_WIDTH TILE_WIDTH Tiled Multiply bx 0 1 2 Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub Nd m bx tx 012 TILE_WIDTH-1 k WIDTH WIDTH 0 Md m by Pd by 1 ty 0 1 2 k Pd sub TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH TILE_WIDTH 2 WIDTH WIDTH 25

View: G80 Registers Each SM has 8k (8192) registers (128k total) each SM can have up to 768 threads so each thread can use up to 8k/768 = 10 registers Now if each thread used 11 registers.. number of executable threads is reduced done at the block level 256 threads/block 768/256 = 3 blocks reduction by 1 block gives 2 blocks 512 threads reduces number of warps by 1/3 and so reduces the ability for latency hiding 26

View: G80 Shared Memory G80 has 16kB shared memory per SM Each SM can have up to 8 blocks so maximum shared memory per block is 2kB if each block used 5kB could only have 3 blocks assigned to each SM 27

View: G80 Matrix Multiplication Example Each SM in G80 has 16KB shared memory SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. So, can potentially have up to 8 Thread Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS! 28

Tiling Size Effects GFLOPS tile d o n ly tile d & u n ro lle d tile d o n ly tile d & u n ro lle d tile d o n ly tile d & u n ro lle d tile d o n ly tile d & u n ro lle d (more on this later) 100 90 80 70 60 50 40 30 20 10 0 no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s 29

Summary- Typical Structure of a CUDA Program Global variables declaration host device... global, constant, texture Function prototypes global void kernelone( ) float handyfunction( ) Main () allocate memory space on the device cudamalloc(&d_glblvarptr, bytes ) transfer data from host to device cudamemcpy(d_glblvarptr, h_gl ) execution configuration setup kernel call kernelone<<<execution configuration>>>( args ); transfer results from device to host cudamemcpy(h_glblvarptr, ) optional: compare against golden (host computed) solution Kernel void kernelone(type args, ) variables declaration - local, shared automatic variables transparently assigned to registers or local memory syncthreads() Other functions float handyfunction(int invar ); repeat as needed 30