CS377P Programming for Performance GPU Programming - II

Similar documents
Warps and Reduction Algorithms

CS 179 Lecture 4. GPU Compute Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 2: CUDA Programming

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUB. collective software primitives. Duane Merrill. NVIDIA Research

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

GPU Fundamentals Jeff Larkin November 14, 2016

Computer Architecture

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Advanced CUDA Optimizations

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

CUDA Architecture & Programming Model

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CS377P Programming for Performance GPU Programming - I

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Advanced CUDA Programming. Dr. Timo Stich

Parallel Patterns Ezio Bartocci

Data Parallel Execution Model

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CUDA Performance Optimization. Patrick Legresley

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Profiling & Tuning Applications. CUDA Course István Reguly

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

GPU Profiling and Optimization. Scott Grauer-Gray

Portland State University ECE 588/688. Graphics Processors

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Cartoon parallel architectures; CPUs and GPUs

CS 179: GPU Programming. Lecture 7

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Numerical Simulation on the GPU

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Tesla Architecture, CUDA and Optimization Strategies

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Processors, Performance, and Profiling

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups

High Performance Computing on GPUs using NVIDIA CUDA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication

Programming in CUDA. Malik M Khan

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Sparse Linear Algebra in CUDA

Dense Linear Algebra. HPC - Algorithms and Applications

Scan Primitives for GPU Computing

GRAPHICS PROCESSING UNITS

CS671 Parallel Programming in the Many-Core Era

Spring Prof. Hyesoon Kim

Hands-on CUDA Optimization. CUDA Workshop

CUDA Programming Model

Multi Agent Navigation on GPU. Avi Bleiweiss

Maximizing Face Detection Performance

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Improving Performance of Machine Learning Workloads

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Preparing seismic codes for GPUs and other

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Practical Introduction to CUDA and GPU

Double-Precision Matrix Multiply on CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Occupancy-based compilation

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Introduction to GPGPU and GPU-architectures

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Arquitetura e Organização de Computadores 2

Lecture 3: CUDA Programming & Compiler Front End

Threading Hardware in G80

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Handout 3. HSAIL and A SIMT GPU Simulator

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Transcription:

CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

Occupancy Recap GPUs partition resources among running threads NVIDIA Manual says maximize occupancy Why?

Reasoning about occupancy kernel <<<x, y>>>() Consider: 1 Thread Block N thread blocks, N equal to number of SMs/SMX N Residency thread blocks > N Residency thread blocks

Less Occupancy? Is there a case to reduce occupancy/residency? i.e. let threads consume more resources? smaller thread blocks?

Better Performance at Lower Occupancy Volkov, V., Better Performance at Lower Occupancy, GTC 2010

Volkov s Insights Do more parallel work per thread to hide latency with fewer threads (i.e. increase ILP) Unroll Use more registers per thread to access slower shared memory less Shared memory latency comparable to registers, but Shared memory throughput is lower! Both may be accomplished by computing multiple outputs per thread Note that Volkov underutilizes threads, but maxes out registers! Fermi had 63 registers/thread, Kepler has 255 registers/thread Why have a register limit?

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

SIMT Issue All threads in a warp execute the same instruction (same PC) What happens when: that instruction is a conditional branch? is a load that misses for some threads but not others?

Divergence If threads in a warp decide execute different PCs, the warp splits Two directions for a branch Two splits Each split is executed serially Nested branches also split correctly Join back at a pre-determined meet point Immediate post-dominator

Example if (cond) { x = 1; } else { y = 1; } Assume warps contains four threads each Assume only T0, T2 have cond == true. Time T0 T1 T2 T3 0 x = 1 x = 1 1 y = 1 y = 1 If cond is true for all threads Time T0 T1 T2 T3 0 x = 1 x = 1 x = 1 x = 1

Tackling Divergence Threads in the same warps should avoid divergent conditions Easier said than done Threads in the same warp should try to access locations in same memory line Memory divergence repeats requests until all threads have received data Compiler will predicate instructions No divergence both sides executed Predicated instructions are executed but do not commit Shown as [] below Time T0 T1 T2 T3 0 x = 1 [x = 1] x = 1 [x = 1] 1 [y = 1] y = 1 [y = 1] y = 1

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

The Cost of Everything What is the ordering of operations based on cost for a GPU? ALU: Integer, FP (+, *, /, %) Special Function Unit: trig, log, etc. Atomics: to the same address, to different addresses Load/Stores: Global Memory, Shared Memory, Registers, Caches (Texture/Constant/L1/L2) Barriers ( Memory Fences syncthreads)

Throughputs

Modeling GPU Performance Performance Equation Time = Operations / Throughput Throughput = Rate at which operations complete Example: Load 144MByte from memory Memory Bandwidth: 144GByte/s Time = 144M/144G = 1ms

Identifying Bottlenecks A GPU program: Reads 144M bytes (144 GBps) Performs 144M atomic operations (1/clock, 745MHz) Carries out 144M FMADDs (192/clock, 745MHz) What is the most likely bottleneck? Reading: 144M/144GBps =? ms Atomics: (144M/1)/745Mhz =? ms FMADDs: (144M/192)/745Mhz =? ms

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

The Scan Primitive Fold reduce a list of values to a single value ([1 2 0 1 3 5], +) Result: 12 Scan reduce a list of values and return intermediate values ([1 2 0 1 3 5], +) Inclusive Scan: [1 3 3 4 7 12] Exclusive Scan: [0 1 3 3 4 7] Also known as: All prefix sums, prefix scan, tree reduction, etc.

Serial implementations of Scan Exclusive Scan: result[0] = 0; for(int i = 1; i < N; i++) result[i] = result[i - 1] + A[i - 1] Inclusive Scan: result[0] = A[0]; for(int i = 1; i < N; i++) result[i] = result[i - 1] + A[i]

Parallel Implementation of Scan: Upsweep Sengupta et al. Scan Primitives for GPU Computing; Harris et al. GPU Gems 3

Parallel Implementation of Scan: Downsweep Sengupta et al. Scan Primitives for GPU Computing; Harris et al. GPU Gems 3

GPU Scan Implementations Very large arrays Store array in global memory Synchronize using multiple kernel calls Arrays that fit in shared memory Store array in shared memory Synchronize using syncthreads() Arrays smaller than warpsize Use warp collective instructions Use NVIDIA CUB library!

Using Scan to reduce the cost of atomics Assume you have a worklist which is implemented as follows: int worklist[1024]; int tail = 0; void push_parallel(int item) { old_tail = atomicadd(&tail, 1); worklist[old_tail] = item; }

Client Code 1 Thread for(int i = 0; i < n; i++) push_parallel(work[i]);

Client Code Optimized 1 Thread old_tail = atomicadd(tail, n) for(int i = 0; i < n; i++) worklist[old_tail + i] = work[i];

Client Code Adding more threads T Threads, each n items, n is same for every thread shared old_tail; if(tid == 0) old_tail = atomicadd(tail, n*t) syncthreads(); for(int i = 0; i < n; i++) worklist[old_tail + n*tid + i] = work[i];

Client Code General Problem T Threads, each n items, n may be different for each thread shared old_tail; if(tid == 0) old_tail = atomicadd(tail,?) syncthreads(); for(int i = 0; i < n; i++) worklist[old_tail +? + i] = work[i];

Client Code 4 Solution T Threads, each n items, n may be different for each thread shared old_tail; int offset; ExclusiveSum(n, total, offset) if(tid == 0) old_tail = atomicadd(tail, total) syncthreads(); for(int i = 0; i < n; i++) worklist[old_tail + offset + i] = work[i]; T0 T1 T2 T3 T4 n 1 0 3 5 1 offset 0 1 1 4 9

Performance

The Many Uses of Scan Stream compaction/filtering When you want to filter an array to another array Radix sort Many more...

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

Scalar Product Problem: Given n pairs of vectors, all w elements wide, compute the scalar products of all the pairs Multiplications: n w Additions: n w How shall we distribute work?

Distribution 1 Assign a pair to a thread block A thread block executes on a single SM What happens if number of pairs is less than number of SMs?

Distribution 2 Divide vectors into parts Assign parts to thread blocks All thread blocks handle one pair at a time What happens if width of vectors is less than number of SMs?

Input size sensitivity Samadi et al. Adaptive Input-aware Compilation for Graphics Engines, PLDI 12.

Solution Enough work to saturate GPU Wide pairs in first case Lots of pairs in second case Just not distributed evenly Write both versions Choose between two versions at runtime depending on input size See MonteCarlo in the CUDA SDK (4.2) for an example

Conclusion GPU architecture has different tradeoffs: Occupancy Divergence GPU costs are different from CPU costs GPU programs can take advantage of several parallel programming primitives scan, in particular novel ways to reduce costs using such collective operations GPU utilization may require multiple schedules we did not cover dynamic scheduling