CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Similar documents
GPU Memory Memory issue for CUDA programming

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Matrix Multiplication in CUDA. A case study

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to CUDA (2 of 2)

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Introduction to GPGPUs and to CUDA programming model

Module Memory and Data Locality

1/25/12. Administrative

Tiled Matrix Multiplication

Programming in CUDA. Malik M Khan

CS 193G. Lecture 4: CUDA Memories

Data Parallel Execution Model

Lessons learned from a simple application

Computation to Core Mapping Lessons learned from a simple application

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Device Memories and Matrix Multiplication

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

GPU programming basics. Prof. Marco Bertini

Module 3: CUDA Execution Model -I. Objective

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CUDA Performance Considerations (2 of 2)

CS 677: Parallel Programming for Many-core Processors Lecture 6

Lecture 5. Performance Programming with CUDA

COSC 6374 Parallel Computations Introduction to CUDA

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Using The CUDA Programming Model

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Dense Linear Algebra. HPC - Algorithms and Applications

Double-Precision Matrix Multiply on CUDA

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Hardware accelerators: an introduction to modern hpc. CISM/CÉCI HPC training session Fall 2014

CME 213 S PRING Eric Darve

Tesla Architecture, CUDA and Optimization Strategies

L3: Data Partitioning and Memory Organization, Synchronization

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

More CUDA. Advanced programming

Introduction to CUDA Programming

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Introduction to GPGPU and GPU-architectures

CUDA Performance Optimization. Patrick Legresley

GPGPU. Lecture 2: CUDA

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Lecture 2: CUDA Programming

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to CUDA (1 of n*)

Nvidia G80 Architecture and CUDA Programming

CUDA Memories. Introduction 5/4/11

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Global Memory Access Pattern and Control Flow

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

High Performance Computing and GPU Programming

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Optimization Strategies Global Memory Access Pattern and Control Flow

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

ECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms

Advanced CUDA Optimizations

Image convolution with CUDA

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Module Memory and Data Locality

Optimizing Parallel Reduction in CUDA

DRAM Bank Organization

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Code Optimizations for High Performance GPU Computing

Lab 1 Part 1: Introduction to CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Paralization on GPU using CUDA An Introduction

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

L14: CUDA, cont. Execution Model and Memory Hierarchy"

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Hands-on CUDA Optimization. CUDA Workshop

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Lecture 2: different memory and variable types

Lecture 1: Introduction

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA Performance Optimization Mark Harris NVIDIA Corporation

CS 314 Principles of Programming Languages

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Convolution, Constant Memory and Constant Caching

Fundamental CUDA Optimization. NVIDIA Corporation

Transcription:

CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1

G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write per-thread local memory Block (0, 0) Shared Memory Block (1, 0) Shared Memory Read/write per-block shared memory Registers Registers Registers Registers Read/write per-grid global memory (0, 0) (1, 0) (0, 0) (1, 0) Read/only per-grid constant memory Host Global Memory Constant Memory 2

CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global grid application device constant int ConstantVar; constant grid application device is optional when used with local, shared, or constant Automatic variables without any qualifier reside in a register Except arrays that reside in local memory 3

Where to Declare Variables? Can host access it? global constant yes no shared local register/automatic Outside of In the kernel any Function 4

Variable Type Restrictions Pointers can only point to memory allocated or declared in global memory: Allocated in the host and passed to the kernel: global void KernelFun (float* ptr) Obtained as the address of a global variable: float* ptr = &GlobalVar; 5

A Common Programming Strategy Global memory resides in device memory (DRAM) - much slower access than shared memory Tile data to take advantage of fast shared memory: Partition data into subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory 6

A Common Programming Strategy (Cont.) Constant memory also resides in device memory (DRAM) - much slower access than shared memory But cached! Highly efficient access for read-only data Carefully divide data according to access patterns R/Only constant memory (very fast if in cache) R/W shared within Block shared memory (very fast) R/W within each thread registers (very fast) R/W inputs/results global memory (very slow) 7

Matrix Multiplication using Shared Memory 8

Matrix Multiplication Kernel global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockidx.y*tile_width + threadidx.y; // Calculate the column idenx of Pd and N int Col = blockidx.x*tile_width + threadidx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; } 9

How about performance on G80? All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/flops 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 GFLOPS Host Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS Grid Block (0, 0) Shared Memory Block (1, 0) Shared Memory Registers Registers Registers Registers (0, 0) (1, 0) (0, 0) (1, 0) Global Memory Constant Memory 10

Idea: Use Shared Memory to reuse global memory data Each input element is read by Width threads. N Load each element into Shared Memory and have WIDTH several threads use the local version to reduce M P memory bandwidth ty WIDTH tx WIDTH WIDTH 11

Tiled matrix A = 1,1 1,2 2,1 2,2..... = 1,1 1,2 2,1 2,2.... = A1,1.... p p/m ( b),j =,k b k,j (AB),j = A,k B k,j k=1 k=1 12

Tiled Multiply bx 0 1 2 tx 0 1 2 TILE_WIDTH-1 Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd by 0 1 ty 0 1 2 TILE_WIDTH-1 Md Nd Pd Pd sub TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH WIDTH TILE_WIDTH TILE_WIDTH TILE_WIDTH 2 WIDTH WIDTH 13

Tiled multiply Nd 0,0 Nd 0,1 Nd 0,2 Nd 1,0 Nd 1,1 Nd 1,2 Nd 0,3 Nd 1,3 Md 0,0 Md 1,0 Md 2,0 Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0 Md 0,1 Md 1,1 Md 2,1 Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1 Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2 Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3 14

Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P 0,0 P 1,0 P 0,1 P 1,1 thread 0,0 thread 1,0 thread 0,1 thread 1,1 M 0,0 * N 0,0 M 0,0 * N 1,0 M 0,1 * N 0,0 M 0,1 * N 1,0 Access order M 1,0 * N 0,1 M 1,0 * N 1,1 M 1,1 * N 0,1 M 1,1 * N 1,1 M 2,0 * N 0,2 M 2,0 * N 1,2 M 2,1 * N 0,2 M 2,1 * N 1,2 M 3,0 * N 0,3 M 3,0 * N 1,3 M 3,1 * N 0,3 M 3,1 * N 1,3 15

Each phase of a Block uses one tile from Md and one from Nd Phase 1 Phase 2 Step 4 Step 5 Step 6 T 0,0 Md 0,0 Mds 0,0 Nd 0,0 Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 Md 2,0 Mds 0,0 Nd 0,2 Nds 0,0 PValue 0,0 += Mds 0,0 *Nds 0,0 + Mds 1,0 *Nds 0,1 T 1,0 Md 1,0 Mds 1,0 Nd 1,0 Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 Md 3,0 Mds 1,0 Nd 1,2 Nds 1,0 PValue 1,0 += Mds 0,0 *Nds 1,0 + Mds 1,0 *Nds 1,1 T 0,1 Md 0,1 Mds 0,1 Nd 0,1 Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 Md 2,1 Mds 0, 1 Nd 0,3 Nds 0,1 PdValue 0,1 += Mds 0,1 *Nds 0,0 + Mds 1,1 *Nds 0,1 T 1,1 Md 1,1 Mds 1,1 Nd 1,1 Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 Md 3,1 Mds 1,1 Nd 1,3 Nds 1,1 PdValue 1,1 += Mds 0,1 *Nds 1,0 + Mds 1,1 *Nds 1,1 time 16

First-order Size Considerations in G80 Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads There should be many thread blocks A 1024*1024 Pd gives 64*64 = 4096 Blocks Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/ add operations. Memory bandwidth no longer a limiting factor 17

CUDA Code Kernel Execution Configuration // Setup the execution configuration dim3 dimblock(tile_width, TILE_WIDTH); dim3 dimgrid(width / TILE_WIDTH, Width / TILE_WIDTH); 18

Tiled Matrix Multiplication Kernel global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){ shared float Mds[TILE_WIDTH][TILE_WIDTH]; shared float Nds[TILE_WIDTH][TILE_WIDTH]; int bx = blockidx.x; int by = blockidx.y; int tx = threadidx.x; int ty = threadidx.y; // Identify row and column of Pd to work on int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0; 19

// Loop over Md and Nd tiles required to compute Pd for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[Col + (m*tile_width + ty)*width]; syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; synchthreads(); } Pd[Row*Width+Col] = Pvalue; } 20

G80 Shared Memory and ing Each SM in G80 has 16KB shared memory shared memory size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. Can potentially have up to 8 Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS! 21

Block size effect on throughput 0 5 10 15 20 Block Size 22

Loop unrolling effects 23

CUDA Memory Model 24

Global Memory memory not cached so using the right access pattern is crucial memory access (to global memory) is very costly variables must be aligned for efficient access (usually are) simultaneous memory accesses in half-warps can be coalesced in one transaction coalescing depends on device compute capabilities 25

'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89:-(%&-898!- G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!*#(!*'!28*! 1(1*'$!2'%#)%52-*#)!-+!-2!)%2-)+-()!24(!+*//*8-#0!24'((!5*#&-2-*#)K!! G4'(%&)!13)2!%55())!! L-24('!C9>.-2!8*'&)=!'()3/2-#0!-#!*#(!;D>.$2(!1(1*'$!2'%#)%52-*#=!! M'!;D>.-2!8*'&)=!'()3/2-#0!-#!*#(!E9F>.$2(!1(1*'$!2'%#)%52-*#=!! M'!E9F>.-2!8*'&)=!'()3/2-#0!-#!28*!E9F>.$2(!1(1*'$!2'%#)%52-*#)N!! "//!E;!8*'&)!13)2!/-(!-#!24(!)%1(!)(01(#2!*+!)-J(!(H3%/!2*!24(!1(1*'$! 2'%#)%52-*#!)-J(!@*'!28-5(!24(!1(1*'$!2'%#)%52-*#!)-J(!84(#!%55())-#0!E9F>.-2! 8*'&)BN!! G4'(%&)!13)2!%55())!24(!8*'&)!-#!)(H3(#5(K!G4(!) 24!24'(%&!-#!24(!4%/+>8%'?!13)2! %55())!24(!) 24!8*'&<! 7+!%!4%/+>8%'?!&*()!#*2!+3/+-//!%//!24(!'(H3-'(1(#2)!%.*,(=!%!)(?%'%2(!1(1*'$! 26

0 128 0 128 1 132 1 132 2 136 2 136 3 140 3 140 4 144 4 144 5 148 5 148 6 152 6 152 7 156 7 156 8 160 8 160 9 164 9 164 10 168 10 168 11 172 11 172 12 176 12 176 13 180 13 180 14 184 14 184 15 188 15 188 Left: coalesced float memory access, resulting in a single memory transaction. Right: coalesced float memory access (divergent warp), resulting in a single memory transaction.! 27

2 136 2 136 3 140 3 140 4 144 4 144 5 148 5 148 6 152 6 152 7 156 7 156 8 160 8 160 9 164 9 164 10 168 10 168 11 172 11 172 12 176 12 176 13 180 13 180 14 184 14 184 15 188 15 188 Left: non-sequential float memory access, resulting in 16 memory transactions. Right: access with a misaligned starting address, resulting in 16 memory transactions.! 28

0 128 0 128 1 132 2 136 3 140 1 140 4 144 5 148 6 152 2 152 7 156 8 160 9 164 3 164 10 168 11 172 12 176 4 176 13 180 14 184 15 188 5 188! Left: non-contiguous float memory access, resulting in 16 memory transactions. Right: non-coalesced float3 memory access, resulting in 16 memory transactions. 29

'$()"*#+%,-$%-."/+#"*-0+12-'$3451"-'(4(6+)+17-89;-(%&-<+,2"=- G4(!0/*.%/!1(1*'$!%55())!.$!%//!24'(%&)!*+!%!4%/+>8%'?!-)!5*%/()5(&!-#2*!%!)-#0/(! 1(1*'$!2'%#)%52-*#!%)!)**#!%)!24(!8*'&)!%55())(&!.$!%//!24'(%&)!/-(!-#!24(!)%1(! )(01(#2!*+!)-J(!(H3%/!2*K!! C9!.$2()!-+!%//!24'(%&)!%55())!F>.-2!8*'&)=!! ;D!.$2()!-+!%//!24'(%&)!%55())!E;>.-2!8*'&)=!! E9F!.$2()!-+!%//!24'(%&)!%55())!C9>.-2!*'!;D>.-2!8*'&)<! 30

! 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 120 124 128 132 136 140 144 148 152 156 160 164 168 172 176 180 184 188 192 196 200 204 208 212 64B segment 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 120 124 128 132 136 140 144 148 152 156 160 164 168 172 176 180 184 188 192 196 204 252 Left: random float memory access within a 64B segment, resulting in one memory transaction. 256 Center: misaligned float memory access, resulting in one transaction. Right: misaligned float memory access, resulting in two transactions. 128B segment 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 96 100 104 108 112 116 120 124 128 132 136 140 144 148 152 156 160 164 168 172 176 180 184 188 32B segment 64B segment 31

Shared memory as fast as register access if there are no bank conflicts 16 equally-sized modules in 1.x n addresses in n banks = one transaction n addresses in one bank = n transactions! bank conflict can be avoided on read-only access successive 32-bit words are assigned to successive banks each bank has a bandwidth of 32 bits per two clock cycles. 32

More on bank conflicts if warp size is 32 and number of banks 16 a shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp so: no bank conflict between threads belonging to the different halves of the same warp! 33

Typical case shared float shared[32]; float data = shared[baseindex + s * tid]; tid vs. tid+n : bank conflicts? s*m % 16 <> s*n % 16 for any m<>n, m-n <16 s=1 (the only safe stride!) 34

Beware of data size! Char is too small shared char shared[32]; char data = shared[baseindex + tid]; 6%8,2.%!shared[0](!shared[1](!shared[2](!,4-!shared[3](!*#'!%@,/<3%(! 6%3#4>!+#!+5%!.,/%!6,47:!;5%'%!,'%!4#!6,47!8#4*3)8+.!5#$%&%'(!)*!+5%!.,/%!,'',0!).!,88%..%-!+5%!*#33#$)4>!$,0M! char data = shared[baseindex + 4 * tid]; ;5%'%!,'%!,3.#!B=$,0!6,47!8#4*3)8+.!*#'!,'',0.!#*! M! 35