Global Memory Access Pattern and Control Flow

Similar documents
Optimization Strategies Global Memory Access Pattern and Control Flow

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CUDA Performance Optimization. Patrick Legresley

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Optimizing Parallel Reduction in CUDA

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lecture 11 Parallel Computation Patterns Reduction Trees

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Lecture 5. Performance Programming with CUDA

CUDA Performance Considerations (2 of 2)

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Computer Architecture

Lecture 2: CUDA Programming

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Advanced CUDA Programming. Dr. Timo Stich

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS671 Parallel Programming in the Many-Core Era

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

CS 314 Principles of Programming Languages

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Programming in CUDA. Malik M Khan

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Advanced CUDA Optimizations

CME 213 S PRING Eric Darve

GPU Fundamentals Jeff Larkin November 14, 2016

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Dense Linear Algebra. HPC - Algorithms and Applications

Cartoon parallel architectures; CPUs and GPUs

CUDA. More on threads, shared memory, synchronization. cuprintf

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Histogram calculation in CUDA. Victor Podlozhnyuk

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

High Performance Linear Algebra on Data Parallel Co-Processors I

Tesla Architecture, CUDA and Optimization Strategies

4.10 Historical Perspective and References

Benchmarking the Memory Hierarchy of Modern GPUs

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Warps and Reduction Algorithms

Data Parallel Execution Model


CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

Image convolution with CUDA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Hardware/Software Co-Design

Parallel Patterns Ezio Bartocci

Shared Memory and Synchronizations

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

CS516 Programming Languages and Compilers II

Lecture 11: GPU programming

GPU MEMORY BOOTCAMP III

Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Parallel Memory Defragmentation on a GPU

CS 179: GPU Programming. Lecture 7

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Unrolling parallel loops

Code Optimizations for High Performance GPU Computing

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Hands-on CUDA Optimization. CUDA Workshop

Introduction to GPGPU and GPU-architectures

CS377P Programming for Performance GPU Programming - II

CUDA Performance Optimization Mark Harris NVIDIA Corporation

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Lecture 4: warp shuffles, and reduction / scan operations

GPU Profiling and Optimization. Scott Grauer-Gray

Double-Precision Matrix Multiply on CUDA

Transcription:

Optimization Strategies Global Memory Access Pattern and Control Flow

Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch)

Global l Memory Access Highest latency instructions: 400-600 clock cycles Likely to be performance bottleneck Optimizations can greatly increase performance Best access pattern: Coalescing Up to 10x speedup

Coalesced Memory Access A coordinated read by a half-warp (16 threads) A contiguous region of global memory: 64 bytes - each thread reads a word: int, float, 128 bytes - each thread reads a double-word: int2, float2, 256 bytes each thread reads a quad-word: int4, float4, Additional restrictions on G8X architecture: Starting address for a region must be a multiple of region size The k th thread in a half-warp must access the k th element in a block being read Exception: not all threads must be participating Predicated access, divergence within a halfwarp

Coalesced Access: Reading floats All threads participate Some threads do not participate

Non-Coalesced Access: Reading floats Permuted access by threads Misaligned starting address (not a multiple of 64)

Example: Non-coalesced float3 read global void accessfloat3(float3 *d_in, float3 d_out) { } int index = blockidx.x * blockdim.x + threadidx.x; float3 a = d_in[index]; a.x += 2; a.y += 2; a.z += 2; d_out[index] = a;

Example: Non-coalesced float3 read (Cont ) float3 is 12 bytes Each thread ends up executing 3 reads sizeof(float3) 4, 8, or 12 Half-warp reads three 64B non-contiguous regions

Example: Non-coalesced float3 read (2) Similarly, step 3 start at offset 512

Example: Non-coalesced float3 read (3) Use shared memory to allow coalescing Need sizeof(float3)*(threads/block) bytes of SMEM Each thread reads 3 scalar floats: Processing Offsets: 0, (threads/block), 2*(threads/block) These will likely be processed by other threads, so sync Each thread retrieves its float3 from SMEM array Cast the SMEM pointer to (float3*) Use thread ID as index Rest of the compute code does not change!

Example: Final Coalesced Code

Coalescing: Structure of Size 4, 8, 16 Bytes Use a structure of arrays instead of Array of Structure If Array of Structure is not viable: Force structure alignment: align(x), where X = 4, 8, or 16 Use SMEM to achieve coalescing

Control Flow Instructions ti in GPUs Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized The control paths taken by the threads in a warp are traversed one at a time until there is no more.

Divergent Branch A common case: avoid divergence when branch condition is a function of thread ID Example with divergence: If (threadidx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Example without divergence: If (threadidx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any ygiven warp follow the same path 14

Parallel l Reduction Given an array of values, reduce them to a single value in parallel Examples sum reduction: sum of all values in the array Max reduction: maximum of all values in the array Typically parallel l implementation: ti Recursively halve # threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads 15

A Vector Reduction Example Assume an in-place reduction using shared memory The original vector is in device global memory The shared memory used to hold a partial sum vector Each iteration brings the partial sum vector closer to the final sum The final solution will be in element 0 16

Vector Reduction Array elements 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations

A simple implementation ti 18

Interleaved Reduction 2 4 6 8 10 12 14 4 8 12 8

Some Observations In each iterations, two control flow paths will be sequentially traversed for each warp Threads that perform addition and threads that do not Threads that do not perform addition may cost extra cycles depending on the implementation of divergence 20

Some Observations (Cont ) No more than half of threads will be executing at any time All odd index threads are disabled right from the beginning! On average, less than ¼ of the threads will be activated for all warps over time. After the 5 th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. This can go on for a while, up to 4 more iterations (512/32=16= 2 4 ), where each iteration only has one thread activated until all warps retire 21

Optimization i 1: Optimization Strategies Replace divergent branch With strided index and non-divergent branch 22

Optimization i 1: (Cont ) No divergence until less than 16 sub sum.

Optimization i 1: Bank Conflict Issue Bank Conflict due to the Strided Addressing 24

Optimization i 2: Sequential Addressing

Optimization i 2: (Cont ) Replace strided indexing With reversed loop and threadid-based indexing

Some Observations About the New Implementation Only the last 5 iterations will have divergence Entire warps will be shut down as iterations progress For a 512-thread block, 4 iterations to shut down all but one warps in each block Better resource utilization, will likely retire warps and thus blocks faster Recall, no bank conflicts either 27