Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Similar documents
Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Lecture 5. Performance Programming with CUDA

Double-Precision Matrix Multiply on CUDA

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CSE 160 Lecture 24. Graphical Processing Units

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Programming in CUDA. Malik M Khan

CUDA Performance Considerations (2 of 2)

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

COSC 6374 Parallel Computations Introduction to CUDA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Lecture 3. Programming with GPUs

Optimizing Parallel Reduction in CUDA

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Performance Optimization. Patrick Legresley

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Dense Linear Algebra. HPC - Algorithms and Applications

Tesla Architecture, CUDA and Optimization Strategies

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Data Parallel Execution Model

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Code Optimizations for High Performance GPU Computing

More CUDA. Advanced programming

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Lecture 2: CUDA Programming

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CME 213 S PRING Eric Darve

CS 314 Principles of Programming Languages

Hardware/Software Co-Design

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to Parallel Computing with CUDA. Oswald Haan

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Lecture 10. Stencil Methods Limits to Performance

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

GPU Fundamentals Jeff Larkin November 14, 2016

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

1/25/12. Administrative

ECE 408 / CS 483 Final Exam, Fall 2014

Lecture 16 Optimizing for the memory hierarchy

Global Memory Access Pattern and Control Flow

CS 179 Lecture 4. GPU Compute Architecture

CS 677: Parallel Programming for Many-core Processors Lecture 6

Introduction to GPGPU and GPU-architectures

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Sparse Linear Algebra in CUDA

Overview: Graphics Processing Units

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Improving Performance of Machine Learning Workloads

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Advanced CUDA Programming. Dr. Timo Stich

Module Memory and Data Locality

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Device Memories and Matrix Multiplication

GPGPUGPGPU: Multi-GPU Programming

CS671 Parallel Programming in the Many-Core Era

GPU programming. Dr. Bernhard Kainz

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Lecture 2. Memory locality optimizations Address space organization

Performance Optimization Part II: Locality, Communication, and Contention

Matrix Multiplication in CUDA. A case study

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Computer Architecture

B. Tech. Project Second Stage Report on

Image convolution with CUDA

Hands-on CUDA Optimization. CUDA Workshop

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Advanced CUDA Optimizations

Optimization Strategies Global Memory Access Pattern and Control Flow

Tuning CUDA Applications for Fermi. Version 1.2

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Transcription:

Lecture 7 Using Shared Memory Performance programming and the memory hierarchy

Announcements Scott B. Baden /CSE 260/ Winter 2014 2

Assignment #1 Blocking for cache will boost performance but a lot more is needed to approach ATLAS performance 8.14 GFlops R = 4*2.33 = 9.32 Gflops ~87% of peak Scott B. Baden /CSE 260/ Winter 2014 3

Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts Scott B. Baden /CSE 260/ Winter 2014 4

Occupancy A minimum number of warps needed to hide memory latency Occupancy: # active warps max # warps supported by vector unit Limited by vector unit resources Amount of shared memory Number of registers Maximum number of threads Consider a kernel (16x16 block size) Shared memory/block = 2648 bytes Reg/thread=38 [38*256 =9728 < 16k] # available registers is the limiting factor Tradeoff: more blocks with fewer threads or more threads with fewer blocks Locality: want small blocks of data (and hence more plentiful warps) that fit into fast memory Register consumption Maximizing the occupancy may not maximize performance p λ fotosearch.com Scott B. Baden /CSE 260/ Winter 2014 5

p λ fotosearch.com Occupancy Calculator http://developer.download.nvidia.com/compute/ cuda/cuda_occupancy_calculator.xls

Determining occupancy Recall the definition for occupancy # active warps max # warps supported by vector unit NVIDIA provides an occupancy calculator Determine resource usage from nvcc for provided A2 code on Dirac (capability 2.0) nvcc --ptxas-options=-v Used 25 registers, 64 bytes cmem[0] Scott B. Baden /CSE 260/ Winter 2014 7

Occupancy calculation with 16 x 16 threads Occupancy = Physical Limits for GPU Compute Capability: 2.0 Threads per Warp 32 Warps per Multiprocessor 48 Threads per Multiprocessor 1536 Thread Blocks per Multiprocessor 8 Total # of 32-bit registers per Multiprocessor 32768 Register allocation unit size 64 Register allocation granularity warp Registers per Thread 63 Shared Memory per Multiprocessor (bytes) 49152 Shared Memory Allocation unit size 128 Warp allocation granularity 2 Maximum Thread Block Size 1024 # active warps per SM Maximum possible # active warps Allocated Resources Per Block Limit Per SM = Allocatable Blocks Per SM Warps (Threads Per Block / Threads Per Warp) 8 48 6 Registers (Warp limit per SM due to per-warp reg count) 8 38 4 Shared Memory (Bytes) 0 49152 8 Note: SM is an abbreviation for (Streaming) Multiprocessor Maximum Thread Blocks Per Multiprocessor Blocks/SM * Warps/Block = Warps/SM Limited by Max Warps or Max Blocks per Multiprocessor 6 8 0 Limited by Registers per Multiprocessor 4 8 32 Limited by Shared Memory per Multiprocessor 8 8 0 Note: Occupancy limiter is shown in orange Physical Max Warps/SM = 48 Occupancy = 32 / 48 = 67% CUDA GPU Occupancy Calculator 1.) Select Compute Capability (click): 2.0 1.b) Select Shared Memory Size Config (bytes) 49152 2.) Enter your resource usage: Threads Per Block 256 Registers Per Thread 25 Shared Memory Per Block (bytes) 0 (Don't edit anything below this line) 3.) GPU Occupancy Data Active Threads per Multiprocessor 1024 Active Warps per Multiprocessor 32 Active Thread Blocks per Multiprocessor 4 Occupancy of each Multiprocessor 67% Scott B. Baden /CSE 260/ Winter 2014 8

Full occupancy Impact of Varying Block Size Multiprocessor Warp Occupancy (# warps) 64 56 48 40 32 24 16 8 0 My Block Size 64 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 Threads Per Block Multiprocessor Warp Occupancy (# warps) 64 56 48 40 32 24 16 8 0 Impact of Varying Register Count Per Thread My Register Count 25 256 248 240 232 224 216 208 200 192 184 176 168 160 152 144 136 128 120 112 104 96 88 80 72 64 56 48 40 32 24 16 8 0 Registers Per Thread Scott B. Baden /CSE 260/ Winter 2014 9

Summary - Programming issues Branches serialize execution within a warp Tradeoff: more blocks with fewer threads or more threads with fewer blocks Locality: want small blocks of data (and hence more plentiful warps) that fit into fast memory Register consumption Scheduling: hide latency Shared memory and registers do not persist across kernel invocations Next: using shared memory Scott B. Baden /CSE 260/ Winter 2014 10

Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts Scott B. Baden /CSE 260/ Winter 2014 11

Speeding up matrix multiply Use shared memory to increase re-use Avoid thread divergence Memory Coalescing, avoid Bank Conflicts Scott B. Baden /CSE 260/ Winter 2014 12

Shared Memory/Cache On-chip local store: part shared memory, part L1 $ 16KB shared memory + 48 KB L1 cache 48KB shared memory + 16 KB L1 cache 1 for each vector unit All threads in a block share this on-chip memory A collection of warps share a portion of the local store Cache accesses to local or global memory, including temporary register spills L2 cache shared by all vector units Cache inclusion (L1 L2) partially configurable on per-access basis with mem. ref. instruction modifiers 128 byte cache line size Set the mode using cudafuncsetcacheconfig() cudafuncsetcacheconfig( boundariesx,preference ) PREFERENCE = {cudafunccacheprefershared, cudafunccachepreferl1} Scott B. Baden /CSE 260/ Winter 2014 13

Naïve kernel implementation Each thread computes one element of C Loads a row of matrix A Loads a column of matrix B Computes a dot product Every value of A and B is loaded N times from global memory Thread Block Thread (2, 2) B 2 4 2 6 3 2 5 4 48 BLOCK_SIZE A C Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC Scott B. Baden /CSE 260/ Winter 2014 14

Naïve Kernel global void matmul(int N DOUBLE* C, DOUBLE* A, DOUBLE* B) { int I = blockidx.x*blockdim.x + threadidx.x; int J = blockidx.y*blockdim.y + threadidx.y; int N = blockdim.y*griddim.y; // Assume a square matrix if ((I < N) && (J < N)){ DOUBLE_c = 0; for (unsigned int k = 0; k < N; k++) { DOUBLEa = A[I * N + k]; DOUBLE b = B[k * N + J]; _c += a * b; } } } C[I * N + J] = _c; for (unsigned int i = 0; i < N; i++) for (unsigned int j = 0; j < N; j++) { DOUBLE sum = 0; for (unsigned int k = 0; k < N; k++) sum += A[i * N + k] * B[k * N + j]; C[i * N + j] = (DOUBLE) sum; } Scott B. Baden /CSE 260/ Winter 2014 15

Recall Blocked Matrix Multiplication N blocks, n n global matrix, b=n/n for i = 0 to N-1 for j = 0 to N-1 // load each block C[i,j] into cache, once : n 2 // b = n/n = block size for k = 0 to N-1 // load each block A[i,k] and B[k,j] N 3 times // = 2N 3 (n/n) 2 = 2Nn 2 C[i,j] += A[i,k] * B[k,j] // do the matrix multiply // write each block C[i,j] once : n 2 Total: (2*N+2)*n 2 C[i,j] C[i,j] A[i,k] = + * B[k,j] Scott B. Baden /CSE 260/ Winter 2014 16

Improving locality in matrix multiply Naïve algorithm Each thread loads all the data it needs, independently loads a row and column of input Each input element loaded multiple times Each thread computes 1 MAD + 2 loads + 1 store Blocked algorithm with shared memory Threads cooperate to load a block of A&B into on-chip shared memory Each thread in the block performs the ijk loop within shared memory Each thread: b mpy-adds + 1 load + 1 store Thread Block Thread (2, 2) 3 2 5 4 BLOCK_SIZE 2 4 2 6 48 Scott B. Baden /CSE 260/ Winter 2014 17

Using shared memory (uncoalesced global) global void matmul( float* C, float* A, float* B, int N) { const unsigned int bx = BLOCK_X, by = BLOCK_Y; const unsigned int tx = threadidx.x, ty = threadidx.y; const unsigned int I = blockidx.x*bx + tx, J = blockidx.y*by + ty; const unsigned int gx = griddim.x, gy = griddim.y; shared float a[block_x][block_y], b[block_x][block_y]; if ((I < N) && (J < N)){ float c = 0.0f; for (unsigned int k=0; k < gy; k++){ a[tx][ty] = A[ I*N+k*by+ty]; b[ty][tx] = B[J+N*(k*bx+tx)]; syncthreads(); // Synchronizes all threads in a block for (unsigned int kk=0; kk< bx; kk++) c += a[kk][tx]*b[kk][ty]; syncthreads(); // Avoids memory hazards } C[I*N+J] = c; } Scott B. Baden /CSE 260/ Winter 2014 18

Results shared memory N=512, double precision Dirac (C2050) Geometry 16 16 8 8 4 4 Uncoalesced 40 39 14 Coalesced 93 66 14 Global memory variant Dirac (prefer SM) (prefer L1$) 57 64 50 58 33 42 Geometries 1 512 2 256 4 128 Scott B. Baden /CSE 260/ Winter 2014 19

Occupancy Today s lecture Improving Matrix Multiplication performance with shared memory Memory coalescing Avoiding bank conflicts Scott B. Baden /CSE 260/ Winter 2014 21

Memory interleaving Compensates for slow memory access times Assume we are accessing memory consecutively What happens if the stride = number of banks? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ruye Wang, fourier.eng.hmc.edu/e85/lectures/memory 1/28/14 Scott B. Baden /CSE 260/ Winter 2014 22

Global memory coalescing Global memory accesses in units of 32, 64, 128 B Does not require sequential accesses by threadsonly need to fall within the same 128B segment Accesses organized by warp Concurrent accesses by a warp s reads coalesce into K transactions = # different cache lines (128B) covered by the accesses Access not cached in L1: 32B at a time (x4) in L2 shrd[threadidx.x] = gbl[blockidx.x*blockdim.x+threadidx.x] YES! shrd[threadidx.x] = gbl[blockdim.x+ blockidx.x*threadidx.x] NOT! 32-byte segments 64-byte segments 128-byte segments Nvidia Cuda Best Practices: Sec. 9.2.1 Nvidia Corp. Scott B. Baden /CSE 260/ Winter 2014 23

Global Memory If accessed word > 4 bytes, warp s memory request split into separate, independently issued 128-byte memory requests Non-atomic, concurrent writes within a warp: writer not defined cudamalloc() is guaranteed to be aligned to at least 256 bytes Scott B. Baden /CSE 260/ Winter 2014 24

Memory coalescing Simplest: addresses fall within the same 128B segment Accesses organized by half warps (16 threads) M 0,0 M 0,1 M 1,0 M 1,1 M 2,0 M 3,0 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 Tile iteration M 0,3 M 1,3 M 2,3 M 3,3 Threads M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Nvidia Corp. Scott B. Baden /CSE 260/ Winter 2014 25

Memory coalescing (compute capability 1.2) Find the segment containing the address request of the lowest numbered active thread Find all other active threads requesting in same segment Reduce transaction size (if possible) Mark the serviced threads as inactive Repeat until all threads in the warp are complete Handles permutations, too 1 transaction 128B segment 128-byte segments 2 transactions 2 x 128B segments Nvidia Cuda C Programming Guide: Appendix G.4.2 Scott B. Baden /CSE 260/ Winter 2014 26

Coalescing with 2d arrays All warps in a block access consecutive elements within a row as they step through neighboring columns I = blockidx.y*by + ty; J = blockidx.x*bx + tx; int tx = threadidx.x a[ty][tx] = A[I*N+k*by+tx] b[ty][tx] = B[J+N*(k*bx+ty)] Accesses by threads in a block along a column don t coalesce I = blockidx.x*bx + tx; J = blockidx.y*by + ty; a[tx][ty] = A[I*N+k*by+ty] b[ty][tx] = B[J+N*(k*bx+tx)] DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Scott B. Baden /CSE 260/ Winter 2014 27

Coalesced access improve performance I = blockidx.y*by + ty; J = blockidx.x*bx + tx; Slow: I = blockidx.x*bx + tx; J = blockidx.y*by + ty; shared float a[blk][blk], b[blk][blk]; if ((I < N) && (J < N)){ float c = 0.0f; for (k=0; k < gy; k++){ Slow: a[ty][tx] = A[I*N+k*by+tx]; a[tx][ty] = A[I*N+k*by+ty]; b[ty][tx] = B[J+N*(k*bx+ty)]; b[ty][tx] = B[J+N*(k*bx+tx)]; syncthreads(); for (kk=0; kk< bx; kk++) c += a[ty][kk]*b[kk][tx]; syncthreads(); } C[I*N+J] = c; Scott B. Baden /CSE 260/ Winter 2014 28

Fin