Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Similar documents
Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 5. Performance Programming with CUDA

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CSE 160 Lecture 24. Graphical Processing Units

Lecture 3. Programming with GPUs

Double-Precision Matrix Multiply on CUDA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Programming in CUDA. Malik M Khan

CUDA Performance Considerations (2 of 2)

Lecture 2: CUDA Programming

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Optimizing Parallel Reduction in CUDA

Data Parallel Execution Model

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Lecture 10. Stencil Methods Limits to Performance

Tesla Architecture, CUDA and Optimization Strategies

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Dense Linear Algebra. HPC - Algorithms and Applications

GPU Fundamentals Jeff Larkin November 14, 2016

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to Parallel Computing with CUDA. Oswald Haan

Hardware/Software Co-Design

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Advanced CUDA Programming. Dr. Timo Stich

CS 179 Lecture 4. GPU Compute Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

COSC 6374 Parallel Computations Introduction to CUDA

ME964 High Performance Computing for Engineering Applications

Fundamental CUDA Optimization. NVIDIA Corporation

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

1/25/12. Administrative

Sparse Linear Algebra in CUDA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Threading Hardware in G80

CUDA Performance Optimization. Patrick Legresley

Warps and Reduction Algorithms

Code Optimizations for High Performance GPU Computing

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Mattan Erez. The University of Texas at Austin

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Unrolling parallel loops

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Advanced CUDA Optimizations

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Computer Architecture

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Improving Performance of Machine Learning Workloads

CS 314 Principles of Programming Languages

Introduction to GPGPU and GPU-architectures

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU programming. Dr. Bernhard Kainz

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Spring Prof. Hyesoon Kim

Lecture 3. SIMD and Vectorization GPU Architecture

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

More CUDA. Advanced programming

GRAPHICS PROCESSING UNITS

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Preparing seismic codes for GPUs and other

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Performance Nuggets

Profiling & Tuning Applications. CUDA Course István Reguly

ME964 High Performance Computing for Engineering Applications

TUNING CUDA APPLICATIONS FOR MAXWELL

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CS 677: Parallel Programming for Many-core Processors Lecture 6

L14: CUDA, cont. Execution Model and Memory Hierarchy"

Overview: Graphics Processing Units

Advanced CUDA Optimizing to Get 20x Performance

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Lecture 2. Memory locality optimizations Address space organization


Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Cartoon parallel architectures; CPUs and GPUs

Tiled Matrix Multiplication

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Device Memories and Matrix Multiplication

Transcription:

Lecture 7 Overlap Using Shared Memory Performance programming the memory hierarchy

Announcements Mac Mini lab (APM 2402) Starts Tuesday Every Tues and Fri for the next 2 weeks Project proposals due on November 9 (revised date) 2012 Scott B. Baden /CSE 260/ Fall 2012 2

Projects Counts for 60% of your grade Complete in 3 weeks See the (growing) list of projects at cseweb.ucsd.edu/classes/fa12/cse260-b/projects/projectlist.html CUDA, MPI or CUDA + MPI Select your project from the list by 11/9 A limited number of self-proposed projects, requires a proposal Progress report: 11/21 (Weds) Final Report: 12/7 (Friday) 2012 Scott B. Baden /CSE 260/ Fall 2012 3

Projects! Stencil method in 3 dimensions Multigrid Communication avoiding matrix multiplication (MPI) Algorithm based fault tolerance (MPI) 3D Fast Fourier Transform (MPI or CUDA) Particle simulation (MPI) Groups of 3 will do a more ambitious project MPI projects can add communication overlap MPI + CUDA Propose your own Make your choice by 11/9 www-cse.ucsd.edu/classes/fa12/cse260-b/projects/projectlist.html 2012 Scott B. Baden /CSE 260/ Fall 2012 5

Overlap Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts 2012 Scott B. Baden /CSE 260/ Fall 2012 6

Fermi platforms in the class CSEClass 01, 02: GeForce GTX 580 [2.0, GF100] 15 Vector units @ 32 cores/unit (480 cores), 4 SFUs 1.25 GB device memory CSEClass 03-07: GeForce GTX 460 [2.1, GF104] 7 Vector units @ 48 cores (384 total cores), 8 SFUs 1.0 GB device memory Dirac: Tesla C2050 [2.0, GF100] 1 device per node 14 Vector units @ 32 cores (448 total cores), 4 SFUs 3 GB device memory + ECC (2.625GB usable) SP MAD: 1030.4 Gflops, DP FMA: 515.2 www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2 NVIDIA 2012 Scott B. Baden /CSE 260/ Fall 2012 7

Warp Scheduling (Fermi) Threads assigned to an SM in units of a thread block, multiple blocks Each block is divided into warps of 32 (SIMD) threads, a schedulable unit A warp becomes eligible for execution when all its operands are available Dynamic instruction reordering: eligible warps selected for execution using a prioritized scheduling policy All threads in a Warp execute the same instruction, branches serialize execution Multiple warps simultaneously active, hiding data transfer delays All registers in all the warps are available, 0 overhead scheduling Hardware is free to assign blocks to any SM How does scheduling work? time MT IU Shared /L1 t0 t1 t2 tm warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. t0 t1 t2 tm t0 t1 t2 tm warp 8 instruction 12 warp 3 instruction 96 10/22/12 2012 Scott B. Baden /CSE 260/ Fall 2012 8

Instruction Issue on Fermi Each vector unit 32 CUDA cores for integer and floating-point arithmetic 4 special function units for Single Precision transcendentals 2 Warp schedulers: each scheduler issues: 1 (2) instructions for capability 2.0 (2.1) One scheduler in charge of odd warp IDs, the other even warp IDs Only 1 scheduler can issue a double-precision floatingpoint instruction at a time Warp scheduler can issue an instruction to ½ the CUDA cores in an SM Scheduler must issue the instruction over 2 clock cycles for an integer or floating-point arithmetic instruction 2012 Scott B. Baden /CSE 260/ Fall 2012 9

Dynamic behavior resource utilization Each vector core (SM): 1024 thread slots and 8 block slots Hardware partitions slots into blocks at run time, accommodates different processing capacities Macmini 320M: 6 SMs Registers are split dynamically across all blocks assigned to the vector core A register is private to a single thread within a single block Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 2012 Scott B. Baden /CSE 260/ Fall 2012 10

Overlap Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts 2012 Scott B. Baden /CSE 260/ Fall 2012 11

Occupancy A minimum number of warps needed to hide memory latency Occupancy: # active warps max # warps supported by vector unit Limited by vector unit resources Amount of shared memory Number of registers Maximum number of threads Consider a kernel (16x16 block size) Shared memory/block = 2648 bytes Reg/thread=38 [38*256 =9728 < 16k] # available registers is the limiting factor Tradeoff: more blocks with fewer threads or more threads with fewer blocks Locality: want small blocks of data (and hence more plentiful warps) that fit into fast memory Register consumption Maximizing the occupancy may not maximize performance 2012 Scott B. Baden /CSE 260/ Fall 2012 p λ fotosearch.com 12

p λ fotosearch.com Occupancy Calculator http://developer.download.nvidia.com/compute/ cuda/cuda_occupancy_calculator.xls

Determining occupancy Recall the definition for occupancy # active warps max # warps supported by vector unit NVIDIA provides an occupancy calculator Determine resource usage from nvcc nvcc --ptxas-options=-v Used 10 registers, 2092+16 bytes smem 2012 Scott B. Baden /CSE 260/ Fall 2012 14

Occupancy calculation with 16 x 16 threads THESE SLIDES NEED WORK RERUN! Occupancy = CUDA GPU Occupancy Calculator Just follow steps 1, 2, and 3 below! (or click here for help) 1.) Select Compute Capability (click): 2.0 2.) Enter your resource usage: Threads Per Block 256 Registers Per Thread 10 Shared Memory Per Block (bytes) 2092 (Don't edit anything below this line) 3.) GPU Occupancy Data is displayed here and in the graphs: Active Threads per Multiprocessor 1536 Active Warps per Multiprocessor 48 Active Thread Blocks per Multiprocessor 6 Occupancy of each Multiprocessor 100% # active warps per SM Maximum possible # active warps Physical Limits for GPU Compute Capability: 2.0 Threads per Warp 32 Warps per Multiprocessor 48 Threads per Multiprocessor 1536 Thread Blocks per Multiprocessor 8 Total # of 32-bit registers per Multiprocessor 32768 Register allocation unit size 64 Register allocation granularity warp Shared Memory per Multiprocessor (bytes) 49152 Shared Memory Allocation unit size 128 Warp allocation granularity (for register allocation) 0 Allocation Per Thread Block Warps 8 Registers 2560 Shared Memory 2176 These data are used in computing the occupancy data in blue Maximum Thread Blocks Per Multiprocessor Blocks Limited by Max Warps / Blocks per Multiprocessor 6 Limited by Registers per Multiprocessor 12 Limited by Shared Memory per Multiprocessor 22 Thread Block Limit Per Multiprocessor highlighted RED 2012 Scott B. Baden /CSE 260/ Fall 2012 15

Full occupancy Varying Register Count Varying Block Size 48 48 40 Multiprocessor Warp Occupancy 40 32 24 16 8 Multiprocessor Warp Occupancy 32 24 16 8 My Register Count 10 0 16 80 144 208 272 336 400 464 Threads Per Block 0 128 124 120 116 112 108 104 100 96 92 88 84 80 76 72 68 64 60 56 52 48 44 40 36 32 28 24 20 16 12 8 4 0 Registers Per Thread Varying Shared Memory Usage Multiprocessor Warp Occupancy 48 40 32 24 16 8 My Shared Memory 2092 0 0 2048 4096 6144 8192 10240 12288 14336 16384 18432 20480 22528 24576 26624 28672 30720 Shared Memory Per Block 32768 34816 36864 38912 40960 43008 45056 47104 49152 2012 Scott B. Baden /CSE 260/ Fall 2012 16

48 40 Varying Block Size 8 x 8 thread blocks 2.) Enter your resource usage: Threads Per Block 64 Registers Per Thread 10 Shared Memory Per Block (bytes) 2092 Multiprocessor Warp Occupancy 32 24 16 8 0 16 80 144 208 272 336 400 464 Threads Per Block Varying Shared Memory Usage Maximum Thread Blocks Per Multiprocessor Blocks Limited by Max Warps / Blocks per Multiprocessor 8 Limited by Registers per Multiprocessor 102 Limited by Shared Memory per Multiprocessor 22 Thread Block Limit Per Multiprocessor highlighted RED 48 40 Varying Register Count Multiprocessor Warp Occupancy 48 40 32 24 16 8 My Shared Memory 2092 Multiprocessor Warp Occupancy 32 24 16 8 My Register Count 10 0 0 2048 4096 6144 8192 10240 12288 14336 16384 18432 20480 22528 24576 26624 28672 Shared Memory Per Block 30720 32768 34816 36864 38912 40960 43008 45056 47104 49152 0 128 124 120 116 112 108 104 100 96 92 88 84 80 76 72 68 64 60 56 52 48 44 40 36 32 28 24 20 16 12 8 4 0 Registers Per Thread 2012 Scott B. Baden /CSE 260/ Fall 2012 17

Summary - Programming issues Branches serialize execution within a warp Tradeoff: more blocks with fewer threads or more threads with fewer blocks Locality: want small blocks of data (and hence more plentiful warps) that fit into fast memory Register consumption Scheduling: hide latency Shared memory and registers do not persist across kernel invocations Next time: using shared memory 2012 Scott B. Baden /CSE 260/ Fall 2012 18

Overlap Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts 2012 Scott B. Baden /CSE 260/ Fall 2012 19

Speeding up matrix multiply Use shared memory to increase re-use Avoid thread divergence Memory Coalescing, avoid Bank Conflicts 2012 Scott B. Baden /CSE 260/ Fall 2012 20

for i = 0 to N-1 for j = 0 to N-1 Blocked Matrix Multiplication // load each block C[i,j] into cache, once : n 2 // b = n/n = block size for k = 0 to N-1 // load each block A[i,k] and B[k,j] N 3 times // = 2N 3 (n/n) 2 = 2Nn 2 C[i,j] += A[i,k] * B[k,j] // do the matrix multiply // write each block C[i,j] once : n 2 Total: (2*N+2)*n 2 C[i,j] C[i,j] A[i,k] = + * B[k,j] 2012 Scott B. Baden /CSE 260/ Fall 2012 21

Naïve kernel implementation Each thread computes one element of C Loads a row of matrix A Loads a column of matrix B Computes a dot product Every value of A and B is loaded N times from global memory Thread Block Thread (2, 2) B 2 4 2 6 3 2 5 4 48 BLOCK_SIZE A C Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 2012 Scott B. Baden /CSE 260/ Fall 2012 22

Naïve Kernel global void matmul(double* C, DOUBLE* A, DOUBLE* B) { int I = blockidx.x*blockdim.x + threadidx.x; int J = blockidx.y*blockdim.y + threadidx.y; int N = blockdim.y*griddim.y; // Assume a square matrix if ((I < N) && (J < N)){ float _c = 0; for (unsigned int k = 0; k < N; k++) { } } float a = A[I * N + k]; float b = B[k * N + J]; _c += a * b; } C[I * N + J] = _c; for (unsigned int i = 0; i < N; i++) for (unsigned int j = 0; j < N; j++) { DOUBLE sum = 0; for (unsigned int k = 0; k < N; k++) sum += A[i * N + k] * B[k * N + j]; C[i * N + j] = (DOUBLE) sum; } 2012 Scott B. Baden /CSE 260/ Fall 2012 23

Shared Memory/Cache On-chip local store per SM: shared memory + L1$ 16KB shared memory + 48 KB L1 cache 48KB shared memory + 16 KB L1 cache All threads in a block share this on-chip memory A collection of warps share a portion of the local store Cache local or global memory, including temporary register spills L2 cache shared by all vector units Cache inclusion (L1? L2) configurable at compile time: L1 & L2 or L1 only 128 byte cache line size Set the preference; initial 48KB Shared, 16KK L1 cudafuncsetcacheconfig( boundariesx,preference ) PREFERENCE = {cudafunccacheprefershared, cudafunccachepreferl1} 2012 Scott B. Baden /CSE 260/ Fall 2012 24

Improving locality in matrix multiply Naïve algorithm Each thread loads all the data it needs, independently loads a row and column of input Each input element loaded multiple times Each thread computes 1 MAD + 2 loads + 1 store Blocked algorithm with shared memory Threads cooperate to load a block of A&B into on-chip shared memory Each thread in the block performs the ijk loop within shared memory Each thread: b mpy-adds + 1 load + 1 store Thread Block Thread (2, 2) 3 2 5 4 BLOCK_SIZE 2 4 2 6 48 2012 Scott B. Baden /CSE 260/ Fall 2012 25

Using shared memory (uncoalesced glbl) global void matmul( float* C, float* A, float* B, int N) { const unsigned int bx = BLOCK_X, by = BLOCK_Y; const unsigned int tx = threadidx.x, ty = threadidx.y; const unsigned int I = blockidx.x*bx + tx, J = blockidx.y*by + ty; const unsigned int gx = griddim.x, gy = griddim.y; shared float a[block_x][block_y], b[block_x][block_y]; if ((I < N) && (J < N)){ float c = 0.0f; for (unsigned int k=0; k < gy; k++){ a[tx][ty] = A[ I*N+k*by+ty]; b[ty][tx] = B[J+N*(k*bx+tx)]; syncthreads(); // Synchronizes all threads in a block for (unsigned int kk=0; kk< bx; kk++) c += a[kk][tx]*b[kk][ty]; syncthreads(); // Avoids memory hazards } C[I*N+J] = c; } 2012 Scott B. Baden /CSE 260/ Fall 2012 26

Results shared memory N=512, double precision Lilliput, C1060, 2.0 GHz Intel Xeon E5504, 4MB L3, peak 8.0 GF / core (Dirac) C2050 Baseline: 23 GFlops on 4 cores of Lilliput 69 Gflops on 8 cores of Triton (double) Geometry 16 16 8 8 4 4 Uncoalesced 9.2 (38) 8.9 8.2 Coalesced 57 (93) 41 15 Global memory variant Gflops dp Lilliput Dirac (prefer SM) (prefer L1$) Geometry Lilliput Dirac 9.8 57 62 2 256 1 512 8.5 53 56 2 128 2 256 2 512 7.4 48 58 2 64 2 128 2012 Scott B. Baden /CSE 260/ Fall 2012 27

Overlap Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts 2012 Scott B. Baden /CSE 260/ Fall 2012 29

Memory interleaving Compensates for slow memory access times Assume we are accessing memory consecutively What happens if the stride = number of banks? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ruye Wang, fourier.eng.hmc.edu/e85/lectures/memory 10/22/12 2012 Scott B. Baden /CSE 260/ Fall 2012 30

Global memory coalescing Global memory accesses in units of 32, 64, 128 B Consecutive addresses read quickly (K=0; Q=1) Certain non-sequential access patterns to global memory degrade performance K mod 16 0; Q 1) Accesses organized by half warps (16 threads) can be done in one or two transactions, under certain conditions (32, 64 and 128 byte segments) tid = blockidx.x*blockdim.x+threadidx.x + K shared[tid] = global[tid] int tid =(blockidx.x*blockdim.x+ threadidx.x)*q 32-byte segments 64-byte segments 128-byte segments Nvidia Corp. 2012 Scott B. Baden /CSE 260/ Fall 2012 32

Memory coalescing Simplest: addresses are contiguous across threads Accesses organized by half warps (16 threads) M 0,0 M 0,1 M 1,0 M 1,1 M 2,0 M 3,0 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 Tile iteration M 0,3 M 1,3 M 2,3 M 3,3 Threads M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Nvidia Corp. 2012 Scott B. Baden /CSE 260/ Fall 2012 33

Memory coalescing (compute capability 1.2) Find the segment containing the address request of the lowest numbered active thread Find all other active threads requesting in same segment Reduce transaction size (if possible) Mark the serviced threads as inactive Repeat until all threads in ½ warp are complete 1 transaction - 64B segment 32-byte segments 64-byte segments 2 transactions - 64B and 32B segments Nvidia Cuda C Programming Guide: Appendix G.3.2.2 2012 Scott B. Baden /CSE 260/ Fall 2012 34

Coalescing with 2d arrays All warps in a block access consecutive elements within a row as they step through neighboring columns I = blockidx.y*by + ty; J = blockidx.x*bx + tx; int tx = threadidx.x a[ty][tx] = A[I*N+k*by+tx] b[ty][tx] = B[J+N*(k*bx+ty)] Accesses by threads in a block along a column don t coalesce 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 I = blockidx.x*bx + tx; J = blockidx.y*by + ty; a[tx][ty] = A[I*N+k*by+ty] b[ty][tx] = B[J+N*(k*bx+tx)] 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 2012 Scott B. Baden /CSE 260/ Fall 2012 35

Coalesced access improve performance I = blockidx.y*by + ty; J = blockidx.x*bx + tx; Slow: I = blockidx.x*bx + tx; J = blockidx.y*by + ty; shared float a[blk][blk], b[blk][blk]; if ((I < N) && (J < N)){ float c = 0.0f; for (k=0; k < gy; k++){ Slow: a[ty][tx] = A[I*N+k*by+tx]; a[tx][ty] = A[I*N+k*by+ty]; b[ty][tx] = B[J+N*(k*bx+ty)]; b[ty][tx] = B[J+N*(k*bx+tx)]; syncthreads(); for (kk=0; kk< bx; kk++) c += a[ty][kk]*b[kk][tx]; syncthreads(); } C[I*N+J] = c; 2012 Scott B. Baden /CSE 260/ Fall 2012 36

Overlap Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts 2012 Scott B. Baden /CSE 260/ Fall 2012 37

Shared memory banks A load or store of n addresses spanning n distinct memory banks can be serviced simultaneously, effective bandwidth n times than single bank bandwidth Multiple addresses map to same memory bank Accesses are serialized Hardware splits request into as many separate conflict-free requests as necessary Exception: if all access the same address: broadcast Devices of compute capability 2.x have the additional ability to multicast shared memory accesses See CUDA C Best Practices Guide 2012 Scott B. Baden /CSE 260/ Fall 2012 38

Shared memory bank access Load/store of n addresses spanning n distinct memory banks can be serviced simultaneously, effective BW = n a single bank s Each bank can service 1 address / cycle (bcast, too) Access to shared memory is fast unless 2 or more instructions in a 1/2 warp access different banks: we have a conflict Exception: if all access the same bank: broadcast Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 Thread 15 Bank 15 int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = a[idx]+1.0f; DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 2012 Scott B. Baden /CSE 260/ Fall 2012 39

Identifying bank conflicts Traditional wisdom for exploiting cache locality can result in bank conflicts What if a thread loads 2 consecutive array elements? int tid = threadidx.x; shared[2*tid] = global[2*tid]; shared[2*tid+1] = global[2*tid+1]; To avoid conflicts shared[tid] = global[tid]; shared[tid + blockdim.x] = global[tid + blockdim.x]; Consider shared float shared[256]; float foo = shared[base + s * threadidx.x]; If s has no common factors with the number of banks (16), then there are no conflicts (s is odd) Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 A memory system with 4 banks Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 2012 Scott B. Baden /CSE 260/ Fall 2012 40

Shared memory design Successive 32-bit words assigned to successive banks For devices of compute capability 2.x [Fermi] Number of banks = 32 Bandwidth is 32 bits per bank per 2 clock cycles Shared memory request for a warp is not split Increased susceptibility to conflicts But no conflicts if access to bytes in same 32 bit word Unlike 1.x, no bank conflicts in the given code example For devices of compute capability 1.x [Lilliput] Number of banks = 16 Bandwidth is 32 bits per bank per clock cycle Shared memory request for a warp is split in two No conflict occurs if only one memory location per bank is accessed by a half warp of threads 2012 Scott B. Baden /CSE 260/ Fall 2012 41

Coalesced access and no bank conflicts I = blockidx.y*by + ty; J = blockidx.x*bx + tx; shared float if ((I < N) && (J < N)){ float c = 0.0f; for (k=0; k < gy; k++){ a[ty][tx] = A[I*N+k*by+tx]; b[ty][tx] = B[J+N*(k*bx+ty)]; syncthreads(); for (kk=0; kk< bx; kk++) a[blk][blk], b[blk][blk]; c += a[ty][kk]*b[kk][tx]; all access same bank: broadcast syncthreads(); } C[I*N+J] = c; Slow: I = blockidx.x*bx + tx; J = blockidx.y*by + ty; Slow: a[tx][ty] = A[I*N+k*by+ty]; b[ty][tx] = B[J+N*(k*bx+tx)]; 2012 Scott B. Baden /CSE 260/ Fall 2012 42