GPU MEMORY BOOTCAMP III

Similar documents
Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Advanced CUDA Optimizations

Optimizing Parallel Reduction in CUDA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Advanced CUDA Programming. Dr. Timo Stich

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Warps and Reduction Algorithms

Parallel Patterns Ezio Bartocci

Lecture 2: CUDA Programming

Fundamental CUDA Optimization. NVIDIA Corporation

Unrolling parallel loops

Profiling & Tuning Applications. CUDA Course István Reguly

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

TUNING CUDA APPLICATIONS FOR MAXWELL

Introduction to GPGPU and GPU-architectures

Computer Architecture

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Fundamental CUDA Optimization. NVIDIA Corporation

More CUDA. Advanced programming

Addressing the Memory Wall

TUNING CUDA APPLICATIONS FOR MAXWELL

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Tiled Matrix Multiplication

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS377P Programming for Performance GPU Programming - I

Fundamental Optimizations

Shared Memory and Synchronizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

CUDA Performance Optimization. Patrick Legresley

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Global Memory Access Pattern and Control Flow


GPU Performance Nuggets

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CS 314 Principles of Programming Languages

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

4.10 Historical Perspective and References

Optimization Strategies Global Memory Access Pattern and Control Flow

Programmable Graphics Hardware (GPU) A Primer

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS 179 Lecture 4. GPU Compute Architecture

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Case Study: Implementing a Vector Kernel on a Vector Processor and GPU

Sparse Linear Algebra in CUDA

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CUB. collective software primitives. Duane Merrill. NVIDIA Research

GPU programming basics. Prof. Marco Bertini

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Introduction to Parallel Computing with CUDA. Oswald Haan

Double-Precision Matrix Multiply on CUDA

Tesla Architecture, CUDA and Optimization Strategies

Programming in CUDA. Malik M Khan

CS377P Programming for Performance GPU Programming - II

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA Performance Optimization

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Performance Optimization Part II: Locality, Communication, and Contention

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Tuning Performance on Kepler GPUs

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Performance Optimization Process

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Maximizing Face Detection Performance

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Dense Linear Algebra. HPC - Algorithms and Applications

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Understanding and modeling the synchronization cost in the GPU architecture

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Improving Performance of Machine Learning Workloads

Lecture 3: CUDA Programming & Compiler Front End

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

CME 213 S PRING Eric Darve

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CS 677: Parallel Programming for Many-core Processors Lecture 6

Programming GPUs for database applications - outsourcing index search operations

Transcription:

April 4-7, 2016 Silicon Valley GPU MEMORY BOOTCAMP III COLLABORATIVE ACCESS PATTERNS Tony Scudiero NVIDIA Devtech

Fanatical Bandwidth Evangelist The Axioms of Modern Performance #1. Parallelism is mandatory Chips getting wider, not faster for 10+ years #2. In real machines, as parallelism increases, real applications become limited by the rate at which operands can be supplied to functional units Flop/BW ratios increase Memory latency increases, single-chip NUMA 2

Bootcamp 1: Best Practices http://on-demand.gputechconf.com/gtc/2015/video/s5353.html Loads in Flight Coalescing Shared Memory & Bank Conflicts Memory Level Parallelism A stream of C5 Galaxy aircraft filled with microsd cards has amazing bandwidth but latency leaves something to be desired 3

Bootcamp 2: Beyond Best Practices http://on-demand.gputechconf.com/gtc/2015/video/s5376.html Opt-In L1 Caching when it helps, when it hurts Maximize data load size when applicable Exploit Memory-Level-Parallelism: Hoist Loads, in-thread latency hiding Maximize loads in flight per thread 4

Bootcamp2: The Lost Slide 5

Bootcamp 2: The Lost Kernel template<typename T, int MAX_SUMMANDS, int OPS_PER_PASS> global void collaborative_access_small_summandversatile (long n, long m, long * offsetarray, T * dataarray, int * iterationarray, int summands, T * results) { shared long iterationoffset[colaborative_cta_size_max]; shared T sdata[ops_per_pass][max_summands+1]; shared T sresult[colaborative_cta_size_max]; shared int sitercount[colaborative_cta_size_max]; shared int maxit; 6

I Collect Quadro M6000s All results are from Quadro M6000 (GM200) 7

Review of Bootcamp 1 and Bootcamp 2 AGENDA Defining Collaborative Patterns Reintroduce Bootcamp Kernel Two collaborative versions of Bootcamp Kernel Example: Signal Processing 8

Defining Collaborative Patterns What Is Collaborative Collaborate: To work together with others to achieve a common goal Non-Exhaustive List of Collaborative Pattern Indicators: Data dependency between threads, especially operation results Number of outputs is less than number of active threads Thread works on data it did not load syncthreads() 9

Collaborative Patterns: Memory Specific Fundamental Principle: Minimize repeated or wasted memory traffic * Achieve Coalescing * Reuse data loaded by other threads Focus on load/store structure of the algorithm As opposed to inputs/outputs or compute structure May result in threads that never issue math ops 10

1D Stencil is Collaborative Reminder: What is 1D Stencil Input Array: radius radius f(x-3:x+3) Output Array: 11

Reuse in 1D Stencil Rather.. It should be collaborative 12

Review of Bootcamp 1 and Bootcamp 2 AGENDA Defining Collaborative Patterns Reintroduce Bootcamp Kernel Two collaborative versions of Bootcamp Kernel Example: Signal Processing 13

THE BOOTCAMP CODE Data Array in Memory Access of a Single Task 14

Type Abstraction Parallelism: M executions Random offset into dataarray Outer Loop (iterations) Inner Loop (Summands) New dataarray offset each Iteration 15

Baseline Performance Kernel Float perf (M iters/s) Float Bandwidth (GB/s) Double perf (M iters/s) Double Bandwidth (GB/s) Reference 145.79 18.67 91.71 23.50 Bootcamp2: LoadAs_u2 405.5 51.95 158.33 40.577 16

Review of Bootcamp 1 and Bootcamp 2 AGENDA Defining Collaborative Access Reintroduce Bootcamp Kernel Two collaborative versions of Bootcamp Kernel Example: Signal Processing 17

Collaborative Approach Each task s starting location is random, but... Subsequent loads (after the first) are not random Chunk of seqential/stride-1 memory Simulates reading a multi-word data structure in AoS fashion Could we load that sequential data (struct) in a coalesced fashion? Assume we cannot parallelize across iterations 18

Collaborative Approaches 1. Each thread block performs one task Vector-style pattern 2. Each thread does one task, thread block collaborates on load Collaborative memory pattern only Compute structure remains similar 19

Collaborative Approach #1 Vector-Style Collaboration Restructure for Coalescing Each block performs the work of a single task Instead of each thread 20

Collaborative Approach #1 Vector-Style kernel shared float sbuffer[max_summands]; int offset = offsetarray[blockidx.x]; for (int i=0; i<iterations; i++) { float r = 0.0f; sbuffer[threadidx.x] = dataarray[offset + threadidx.x]; syncthreads(); } if(threadidx.x == 0) { for(int s=0; s<summands; s++) { r += sbuffer[s]; } results[blockidx.x] += r; } offset = offsetarray[offset]; Assumes (summands == blockdim.x), Condition checks omitted from slide blockidx.x == task id Single Thread performs compute! 21

Performance Report Kernel Float perf (M iters/s) Float Bandwidth (GB/s) Double perf (M iters/s) Double Bandwidth (GB/s) Reference 145.79 18.67 91.71 23.50 Bootcamp2: LoadAs_u2 405.5 51.95 158.33 40.577 Vector-Style 364.3 46.7 90.1 23.1 22

Sneaky Stuff The code on the previous slide works as expected on Single precision on all hardware Double precision on full-throughput double precision hardware Which GM200 is not Single thread reduction is compute limiting due to double precision latency Using a parallel reduction returns to our expectations 23

Performance Report Kernel Float perf (M iters/s) Float Bandwidth (GB/s) Double perf (M iters/s) Double Bandwidth (GB/s) Reference 145.79 18.67 91.71 23.50 Bootcamp2: LoadAs_u2 405.5 51.95 158.33 40.577 Vector-Style 364.3 46.7 90.1 23.1 Vector-Style w/reduce 338.1 43.32 266.84 68.37 24

A Second Version A Hybrid Kernel Single thread per task ( like original kernel) Whole block performs load ( like vector kernel) Collaborative memory structure Threads join up and collaborate for loads, split up for compute Can diverge in compute, must be kept alive 25

Collaborative Access Kernel In Pseudocode for i=1:iterations for each thread in the block coalesced load of data for this iteration into shared buffer end syncthreads(); for s=1:summands r += shared buffer[threadidx.x][s]; end update offsets for next iteration syncthreads(); end for Load Step Compute Step 26

COLLABORATIVE LOAD Each color is the data for one iteration of one task Shared Memory Buffer Data Array in Memory 27

COLLABORATIVE LOAD Each color is the data for one iteration of one task Shared Memory Buffer Data Array in Memory 28

COLLABORATIVE LOAD Each color is the data for one iteration of one task Shared Memory Buffer Data Array in Memory 29

Compute Step Shared Memory Buffer thread 0 thread 1 thread 2... result[0] result[1] result[2]... After the load step, each thread processes its data from shared memory thread B result[b] B = blockdim.x 30

Collaborative Access Kernel Assumes blockdim.x == summands, No bounds Checks on slide shared long iterationoffset[threads_per_block]; shared float sdata[threads_per_block][max_summands+1]; long idx = blockdim.x*blockidx.x + threadidx.x; long offset = offsetarray[idx]; for(int i=0; i<iterations; i++) { iterationoffset[threadidx.x] = offset; for(int s=0; s<threads_per_block; s++) { sdata[s][threadidx.x] = dataarray[iterationoffset[s] + threadidx.x];} syncthreads(); Load Step } float r = 0.0; for(int s=0; s<summands; s++) { r+= sdata[threadidx.x][s]; } result[idx] += r; offset = offsetarray[offset]; syncthreads(); Compute Step 31

What It really looks like Even at 32 threads per block (1 warp), sdata is 4kb(float) or 8kb(double). Max Occupancy 15.6%(f) 7.8%(d) Solution: Multiple passes per thread block Shared buffer is 32xK. for K=8, sdata is 1024b or 2048b per block All threads load, threads 0-7 perform computation Only 25% of threads are doing computations K=8 and K=16 are best performers on GM200 32

Performance Report Kernel Float perf (M iters/s) Float Bandwidth (GB/s) Double perf (M iters/s) Double Bandwidth (GB/s) Reference 145.79 18.67 91.71 23.50 Bootcamp2: LoadAs_u2 405.5 51.95 158.33 40.577 Vector-Style 364.3 46.7 90.1 23.1 Vector-Style w/reduce 338.1 43.32 266.84 68.37 Collaborative Load 736.0 94.3 452.76 116.02 33

Discussion Slide 32 threads per thread block, single precision Kernel Threads performing addition Tasks per block Performance GM200 (M iters/s) Bandwidth GM200 (GB/s) Reference blockdim.x blockdim.x 145.79 18.67 LoadAs blockdim.x blockdim.x 405.5 51.95 Vector-Style 1 1 364.3 46.7 Collaborative Access blockdim.x K blockdim.x 736.0 94.3 34

Comparing the Kernels Why is the collaborative access better? Loads In Flight (threadblock) 35

But wait, there s more GM200 is Maxwell which doesn t have full throughput double precision Shouldn t double precision be 1/32 nd the speed of float? Kernel Performance Bandwidth M Iters/s Vs. Float GB/sec Vs. float Vector w/reduce (float) 338.1-43.32 - Vector w/reduce (double) 266.84 78.9% 68.37 157.8% Collaborative(float) 738.131-94.57 - Collaborative(double) 452.9 61.4% 116.06 122.7% 36

Kernel Kernel One More Thing Float perf (M iters/s) Float BW (GB/s) Doubleperf (M iters/s) M iters/gb Reference 145.79 18.67 7.809 Bootcamp2: LoadAs_u2 405.5 51.95 7.805 Vector-Style 364.3 46.7 7.800 Collaborative 338.1 43.32 7.805 Double BW(GB/s) M iters/gb Reference 91.71 23.50 3.907 Bootcamp2: LoadAs_u2 158.33 40.577 3.902 Vector-Style w/reduce 266.84 68.37 3.903 Collaborative 452.76 116.02 3.902 37

Review of Bootcamp 1 and Bootcamp 2 AGENDA Defining Collaborative Patterns Reintroduce Bootcamp Kernel Two collaborative versions of Bootcamp Kernel Example: Signal Processing 38

Signal Processing: Multi-Convolution The Algorithm K (Signal, Filter) pairs Signal length: N Arbitrarily long Filter Length: M Very Small 39

Collaborative Implementation One block per convolution, i.e., (filter, signal) pair Each convolution is executed vector-like within the block Filter is collaboratively loaded into shared memory (cached) Signal is buffered in shared (cached) Strong resemblance to 1D Stencil! 41

Coalesced write Signal Processing: Bank Convolution Slide-Simplified kernel (no bounds checking) shared float filterbuffer[filterlength]; shared float signalbuffer[2*block_size]; int index = blockidx.x; filterbuffers[threadidx.x] = filters[index][filterlength-threadidx.x];} signalbuffer[blockdim.x + threadidx.x] = 0.0f; for(int i=0; i<signallength; i+=blockdim.x) { syncthreads(); signalbuffer[threadidx.x] = signalbuffer[blockdim.x + threadidx.x]; } signalbuffer[blockdim.x + threadidx.x] = signal[i*blockdim.x + threadidx.x]; syncthreads(); float r = 0.0f; for(int t=0; t<filterlength; t++) { r+= filterbuffers[t]*signalbuffer[t+threadidx.x]; } results[index][i*blockdim.x + threadidx.x] = r; Collaborative Loads are coalesced Operands & results all in shared mem and registers 42

Review of Bootcamp 1 and Bootcamp 2 AGENDA Defining Collaborative Patterns Reintroduce Bootcamp Kernel Two collaborative versions of Bootcamp Kernel Example: Signal Processing 43

Specific Results Kernel Investigated At 1/32 nd throughput, one thread/block doing compute is a limiter in this code At 1/32 nd throughput, parallel compute is not a limiter in this code At full throughput, 1 thread/block doing compute is not a limiter in this code Collaborative pattern worked well for bootcamp kerenel 44

General Observations Collaborative patterns seem to use shared memory Yes in-block communication - collaboration requires communication Collaborative patterns are good when there is data reuse Yes data reuse is a natural candidate for collaborative load structure If code is limited by the rate at which operands are delivered to functional units: a). Optimization efforts should focus on operand delivery b). In collaborative kernels, some threads may not do any compute 45

TROVE https://github.com/bryancatanzaro/trove A GPU Library from Bryan Catanzaro (of Deep Learning fame) Library to allow access to AoS data structures Gather large structures from random locations in GPU memory 46

Is This Code on github? By far, the most frequently asked bootcamp question The wait is over... https://github.com/tscudiero/membootcamp 47

April 4-7, 2016 Silicon Valley THANK YOU