Optimization Strategies Global Memory Access Pattern and Control Flow

Similar documents
Global Memory Access Pattern and Control Flow

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Performance Optimization. Patrick Legresley

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Lecture 11 Parallel Computation Patterns Reduction Trees

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Performance Considerations (2 of 2)

Lecture 5. Performance Programming with CUDA

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Lecture 2: CUDA Programming

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Computer Architecture

CS 677: Parallel Programming for Many-core Processors Lecture 6

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Fundamental CUDA Optimization. NVIDIA Corporation

CS671 Parallel Programming in the Many-Core Era

Fundamental CUDA Optimization. NVIDIA Corporation

Advanced CUDA Programming. Dr. Timo Stich

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

CS 314 Principles of Programming Languages

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Programming in CUDA. Malik M Khan

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CME 213 S PRING Eric Darve

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Cartoon parallel architectures; CPUs and GPUs

GPU Fundamentals Jeff Larkin November 14, 2016

Histogram calculation in CUDA. Victor Podlozhnyuk

CUDA. More on threads, shared memory, synchronization. cuprintf

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

4.10 Historical Perspective and References

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Tesla Architecture, CUDA and Optimization Strategies

High Performance Linear Algebra on Data Parallel Co-Processors I

Image convolution with CUDA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17


Lecture 4: warp shuffles, and reduction / scan operations

Warps and Reduction Algorithms

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Benchmarking the Memory Hierarchy of Modern GPUs

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Advanced CUDA Optimizations

CS516 Programming Languages and Compilers II

Lecture 11: GPU programming

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Hardware/Software Co-Design

Shared Memory and Synchronizations

Scan Algorithm Effects on Parallelism and Memory Conflicts

GPU MEMORY BOOTCAMP III

Dense Linear Algebra. HPC - Algorithms and Applications

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

Data Parallel Execution Model

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

Unrolling parallel loops

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

Parallel Patterns Ezio Bartocci

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Code Optimizations for High Performance GPU Computing

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to GPGPU and GPU-architectures

Hands-on CUDA Optimization. CUDA Workshop

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Lecture 3: CUDA Programming & Compiler Front End

CS377P Programming for Performance GPU Programming - II

CS 179: GPU Programming. Lecture 7

GPU Profiling and Optimization. Scott Grauer-Gray

Double-Precision Matrix Multiply on CUDA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Transcription:

Global Memory Access Pattern and Control Flow

Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch)

Highest latency instructions: 400-600 clock cycles Likely to be performance bottleneck Optimizations can greatly increase performance Best access pattern: Coalescing Up to 10x speedup

A coordinated read by a half-warp (16 threads) A contiguous region of global memory: 64 bytes - each thread reads a word: int, float, 128 bytes - each thread reads a double-word: int2, float2, 256 bytes each thread reads a quad-word: int4, float4, Additional restrictions on G8X architecture: Starting address for a region must be a multiple of region size The k th thread in a half-warp must access the k th element in a block being read Exception: not all threads must be participating Predicated access, divergence within a halfwarp

All threads participate Some threads do not participate

Permuted access by threads Misaligned starting address (not a multiple of 64)

global void accessfloat3(float3 *d_in, float3 d_out) { } int index = blockidx.x * blockdim.x + threadidx.x; float3 a = d_in[index]; a.x += 2; a.y += 2; a.z += 2; d_out[index] = a;

float3 is 12 bytes Each thread ends up executing 3 reads sizeof(float3) 4, 8, or 12 Half-warp reads three 64B non-contiguous regions

Similarly, step 3 start at offset 512

Use shared memory to allow coalescing Need sizeof(float3)*(threads/block) bytes of SMEM Each thread reads 3 scalar floats: Offsets: 0, (threads/block), 2*(threads/block) These will likely be processed by other threads, so sync Processing Each thread retrieves its float3 from SMEM array Cast the SMEM pointer to (float3*) Use thread ID as index Rest of the compute code does not change!

Use a structure of arrays instead of Array of Structure If Array of Structure is not viable: Force structure alignment: align(x), where X = 4, 8, or 16 Use SMEM to achieve coalescing

Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized The control paths taken by the threads in a warp are traversed one at a time until there is no more.

A common case: avoid divergence when branch condition is a function of thread ID Example with divergence: If (threadidx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Example without divergence: If (threadidx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path 14

Given an array of values, reduce them to a single value in parallel Examples sum reduction: sum of all values in the array Max reduction: maximum of all values in the array Typically parallel implementation: Recursively halve # threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads 15

Assume an in-place reduction using shared memory The original vector is in device global memory The shared memory used to hold a partial sum vector Each iteration brings the partial sum vector closer to the final sum The final solution will be in element 0 16

Array elements 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 iterations 0..7 8..15

18

2 4 6 8 10 12 14 4 8 12 8

In each iterations, two control flow paths will be sequentially traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence 20

No more than half of threads will be executing at any time All odd index threads are disabled right from the beginning! On average, less than ¼ of the threads will be activated for all warps over time. After the 5 th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 2 4 ), where each iteration only has one thread activated until all warps retire 21

Replace divergent branch With strided index and non-divergent branch 22

No divergence until less than 16 sub sum.

Bank Conflict due to the Strided Addressing 24

Replace strided indexing With reversed loop and threadid-based indexing

Only the last 5 iterations will have divergence Entire warps will be shut down as iterations progress For a 512-thread block, 4 iterations to shut down all but one warps in each block Better resource utilization, will likely retire warps and thus blocks faster Recall, no bank conflicts either 27