Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Similar documents
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CS671 Parallel Programming in the Many-Core Era

CUDA. Sathish Vadhiyar High Performance Computing

CS 314 Principles of Programming Languages

Computer Architecture

CUDA Technical Training

CS516 Programming Languages and Compilers II

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Global Memory Access Pattern and Control Flow

Lecture 2: CUDA Programming

Advanced CUDA Optimizations

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Optimization Strategies Global Memory Access Pattern and Control Flow

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Warps and Reduction Algorithms

AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS : High-Performance Computing with CUDA

Tesla Architecture, CUDA and Optimization Strategies

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Hardware/Software Co-Design

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Generating Performance Portable Code using Rewrite Rules

4.10 Historical Perspective and References

CUDA Performance Considerations (2 of 2)

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 5. Performance Programming with CUDA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Real-time Graphics 9. GPGPU

Card Sizes. Tesla K40: 2880 processors; 12 GB memory


CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Motivation OpenACC and OpenMP for Accelerators Cray Compilation Environment (CCE) Examples

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Introduction to GPGPU and GPU-architectures

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Lecture 3: CUDA Programming & Compiler Front End

Advanced CUDA Programming. Dr. Timo Stich

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

GPU MEMORY BOOTCAMP III

Lecture 11 Parallel Computation Patterns Reduction Trees

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

Portable and Productive Performance on Hybrid Systems with OpenACC Compilers and Tools

CME 213 S PRING Eric Darve

Real-time Graphics 9. GPGPU

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Fast Tridiagonal Solvers on GPU

Programming in CUDA. Malik M Khan

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Hands-on CUDA Optimization. CUDA Workshop

Lecture 11: GPU programming

CUDA Performance Optimization. Patrick Legresley

Understanding and modeling the synchronization cost in the GPU architecture

CUDA Performance Optimization Mark Harris NVIDIA Corporation

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Introduction to Parallel Computing with CUDA. Oswald Haan

Programmable Graphics Hardware (GPU) A Primer

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Unrolling parallel loops

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Programming Model

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

CS 677: Parallel Programming for Many-core Processors Lecture 6

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Parallel Patterns Ezio Bartocci

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

More CUDA. Advanced programming

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

Fundamental CUDA Optimization. NVIDIA Corporation

COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PARSEC Benchmark Suite: A Parallel Implementation on GPU using CUDA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Lecture 2: different memory and variable types

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Dense Linear Algebra. HPC - Algorithms and Applications

CUDA. More on threads, shared memory, synchronization. cuprintf

Programmer's View of Execution Teminology Summary

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

University of Bielefeld

Transcription:

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as a great optimization example We ll walk step by step through different versions Demonstrates several important optimization strategies

Parallel Reduction Tree-based approach used within each thread block 3 1 4 1 6 3 4 5 9 14 5 Need to be able to use multiple thread blocks To process very large arrays To keep all multiprocessors on the GPU busy Each thread block reduces a portion of the array But how do we communicate partial results between thread blocks? 3

Problem: Global Synchronization If we could synchronize across all thread blocks, could easily reduce very large arrays, right? Global sync after each block produces its result Once all blocks reach sync, continue recursively But CUDA has no global synchronization. Why? Expensive to build in hardware for GPUs with high processor count Would force programmer to run fewer blocks (no more than # multiprocessors * # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency Solution: decompose into multiple kernels Kernel launch serves as a global synchronization point Kernel launch has negligible HW overhead, low SW overhead 4

Solution: Kernel Decomposition Avoid global sync by decomposing computation into multiple kernel invocations 3 1 4 1 6 3 4 5 9 14 5 3 1 4 1 6 3 4 5 9 14 5 3 1 4 1 6 3 4 5 9 14 5 3 1 4 1 6 3 4 5 9 14 5 3 1 4 1 6 3 4 5 9 14 5 3 1 4 1 6 3 4 5 9 14 5 3 1 4 1 6 3 4 5 9 14 5 3 1 4 1 6 3 4 5 9 14 5 Level : 8 blocks 3 1 4 1 6 3 4 5 9 14 5 Level 1: 1 block In the case of reductions, code for all levels is the same Recursive kernel invocation 5

What is Our Optimization Goal? We should strive to reach GPU peak performance Choose the right metric: GFLOP/s: for compute-bound kernels Bandwidth: for memory-bound kernels Reductions have very low arithmetic intensity 1 flop per element loaded (bandwidth-optimal) Therefore we should strive for peak bandwidth Will use G8 GPU for this example 384-bit memory interface, 9 MHz DDR 384 * 18 / 8 = 86.4 GB/s 6

Reduction #1: Interleaved Addressing global void reduce(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockdim.x; s *= ) { if (tid % (*s) == ) { sdata[tid] += sdata[tid + s]; syncthreads(); // write result for this block to global mem if (tid == ) g_odata[blockidx.x] = sdata[];

Parallel Reduction: Interleaved Addressing Values (shared memory) 1 1 8-1 - 3 5 - -3 Step 1 Stride 1 Thread IDs 4 6 8 1 1 14 Values 1-1 - - 8 5-5 -3 9 Step Stride Thread IDs 4 8 1 Values 18 1-1 6-8 5 4-3 9 13 Step 3 Stride 4 Thread IDs 8 Values 4 1-1 6-8 5 1-3 9 13 Step 4 Stride 8 Thread IDs Values 41 1-1 6-8 5 1-3 9 13 8

Reduction #1: Interleaved Addressing global void reduce1(int *g_idata, int *g_odata) { extern shared int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // do reduction in shared mem for (unsigned int s=1; s < blockdim.x; s *= ) { if (tid % (*s) == ) { sdata[tid] += sdata[tid + s]; syncthreads(); Problem: highly divergent branching results in very poor performance! // write result for this block to global mem if (tid == ) g_odata[blockidx.x] = sdata[]; 9

Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching Time ( ints) 8.54 ms Bandwidth.83 GB/s Note: Block Size = 18 threads for all tests 1

Reduction #: Interleaved Addressing Just replace divergent branch in inner loop: for (unsigned int s=1; s < blockdim.x; s *= ) { if (tid % (*s) == ) { sdata[tid] += sdata[tid + s]; syncthreads(); With strided index and non-divergent branch: for (unsigned int s=1; s < blockdim.x; s *= ) { int index = * s * tid; if (index < blockdim.x) { sdata[index] += sdata[index + s]; syncthreads();

Parallel Reduction: Interleaved Addressing Values (shared memory) 1 1 8-1 - 3 5 - -3 Step 1 Stride 1 Thread IDs 1 3 4 5 6 Values 1-1 - - 8 5-5 -3 9 Step Stride Thread IDs 1 3 Values 18 1-1 6-8 5 4-3 9 13 Step 3 Stride 4 Thread IDs 1 Values 4 1-1 6-8 5 1-3 9 13 Step 4 Stride 8 Thread IDs Values 41 1-1 6-8 5 1-3 9 13 New Problem: Shared Memory Bank Conflicts 1

Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching Time ( ints) 8.54 ms Bandwidth.83 GB/s Step Speedup Cumulative Speedup Kernel : interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s.33x.33x 13

Parallel Reduction: Sequential Addressing Values (shared memory) 1 1 8-1 - 3 5 - -3 Step 1 Stride 8 Thread IDs 1 3 4 5 6 Values 8-1 6 9 3 - -3 Step Stride 4 Thread IDs 1 3 Values 8 13 13 9 3 - -3 Step 3 Stride Thread IDs 1 Values 1 13 13 9 3 - -3 Step 4 Stride 1 Thread IDs Values 41 13 13 9 3 - -3 Sequential addressing is conflict free 14

Reduction #3: Sequential Addressing Just replace strided indexing in inner loop: for (unsigned int s=1; s < blockdim.x; s *= ) { int index = * s * tid; if (index < blockdim.x) { sdata[index] += sdata[index + s]; syncthreads(); With reversed loop and threadid-based indexing: for (unsigned int s=blockdim.x/; s>; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; syncthreads(); 15

Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching Time ( ints) 8.54 ms Bandwidth.83 GB/s Step Speedup Cumulative Speedup Kernel : interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s.33x.33x Kernel 3: sequential addressing 1. ms 9.41 GB/s.1x 4.68x 16

Idle Threads Problem: for (unsigned int s=blockdim.x/; s>; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; syncthreads(); Half of the threads are idle on first loop iteration! This is wasteful 1

Reduction #4: First Add During Load Halve the number of blocks, and replace single load: // each thread loads one element from global to shared mem unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); With two loads and first add of the reduction: // perform first level of reduction, // reading from global memory, writing to shared memory unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blockdim.x*) + threadidx.x; sdata[tid] = g_idata[i] + g_idata[i+blockdim.x]; syncthreads(); 18

Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching Time ( ints) 8.54 ms Bandwidth.83 GB/s Step Speedup Cumulative Speedup Kernel : interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s.33x.33x Kernel 3: sequential addressing 1. ms 9.41 GB/s.1x 4.68x Kernel 4: first add during global load.965 ms 1.3 GB/s 1.8x 8.34x 19

Instruction Bottleneck At 1 GB/s, we re far from bandwidth bound And we know reduction has low arithmetic intensity Therefore a likely bottleneck is instruction overhead Ancillary instructions that are not loads, stores, or arithmetic for the core computation In other words: address arithmetic and loop overhead Strategy: unroll loops

Unrolling the Last Warp As reduction proceeds, # active threads decreases When s <= 3, we have only one warp left Instructions are SIMD synchronous within a warp That means when s <= 3: We don t need to syncthreads() We don t need if (tid < s) because it doesn t save any work Let s unroll the last 6 iterations of the inner loop 1

Reduction #5: Unroll the Last Warp for (unsigned int s=blockdim.x/; s>3; s>>=1) { if (tid < s) sdata[tid] += sdata[tid + s]; syncthreads(); if (tid < 3) { sdata[tid] += sdata[tid + 3]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + ]; sdata[tid] += sdata[tid + 1]; Note: This saves useless work in all warps, not just the last one! Without unrolling, all warps execute every iteration of the for loop and if statement

Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching Time ( ints) 8.54 ms Bandwidth.83 GB/s Step Speedup Cumulative Speedup Kernel : interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s.33x.33x Kernel 3: sequential addressing 1. ms 9.41 GB/s.1x 4.68x Kernel 4: first add during global load.965 ms 1.3 GB/s 1.8x 8.34x Kernel 5: unroll last warp.536 ms 31.89 GB/s 1.8x 15.1x 3

Complete Unrolling If we knew the number of iterations at compile time, we could completely unroll the reduction Luckily, the block size is limited by the GPU to 51 threads Also, we are sticking to power-of- block sizes So we can easily unroll for a fixed block size But we need to be generic how can we unroll for block sizes that we don t know at compile time? Templates to the rescue! CUDA supports C++ template parameters on device and host functions 4

Unrolling with Templates Specify block size as a function template parameter: template <unsigned int blocksize> global void reduce5(int *g_idata, int *g_odata) 5

Reduction #6: Completely Unrolled if (blocksize >= 51) { if (tid < 56) { sdata[tid] += sdata[tid + 56]; syncthreads(); if (blocksize >= 56) { if (tid < 18) { sdata[tid] += sdata[tid + 18]; syncthreads(); if (blocksize >= 18) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; syncthreads(); if (tid < 3) { if (blocksize >= 64) sdata[tid] += sdata[tid + 3]; if (blocksize >= 3) sdata[tid] += sdata[tid + 16]; if (blocksize >= 16) sdata[tid] += sdata[tid + 8]; if (blocksize >= 8) sdata[tid] += sdata[tid + 4]; if (blocksize >= 4) sdata[tid] += sdata[tid + ]; if (blocksize >= ) sdata[tid] += sdata[tid + 1]; Note: all code in RED will be evaluated at compile time. Results in a very efficient inner loop! 6

Invoking Template Kernels Don t we still need block size at compile time? Nope, just a switch statement for 1 possible block sizes: switch (threads) { case 51: reduce5<51><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 56: reduce5<56><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 18: reduce5<18><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 64: reduce5< 64><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 3: reduce5< 3><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 16: reduce5< 16><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 8: reduce5< 8><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 4: reduce5< 4><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case : reduce5< ><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break; case 1: reduce5< 1><<< dimgrid, dimblock, smemsize >>>(d_idata, d_odata); break;

Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching Time ( ints) 8.54 ms Bandwidth.83 GB/s Step Speedup Cumulative Speedup Kernel : interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s.33x.33x Kernel 3: sequential addressing 1. ms 9.41 GB/s.1x 4.68x Kernel 4: first add during global load.965 ms 1.3 GB/s 1.8x 8.34x Kernel 5: unroll last warp.536 ms 31.89 GB/s 1.8x 15.1x Kernel 6: completely unrolled.381 ms 43.996 GB/s 1.41x 1.16x 8

Parallel Reduction Complexity Log(N) parallel steps, each step S does N/ S independent ops Step Complexity is O(log N) For N= D, performs S [1..D] D-S = N-1 operations Work Complexity is O(N) It is work-efficient i.e. does not perform more operations than a sequential algorithm With P threads physically in parallel (P processors), time complexity is O(N/P + log N) Compare to O(N) for sequential reduction In a thread block, N=P, so O(log N) 9

What About Cost? Cost of a parallel algorithm is processors time complexity Allocate threads instead of processors: O(N) threads Time complexity is O(log N), so cost is O(N log N) : not cost efficient! Brent s theorem suggests O(N/log N) threads Each thread does O(log N) sequential work Then all O(N/log N) threads cooperate for O(log N) steps Cost = O((N/log N) * log N) = O(N) cost efficient Sometimes called algorithm cascading Can lead to significant speedups in practice 3

Algorithm Cascading Combine sequential and parallel reduction Each thread loads and sums multiple elements into shared memory Tree-based reduction in shared memory Brent s theorem says each thread should sum O(log n) elements i.e. 14 or 48 elements per block vs. 56 In my experience, beneficial to push it even further Possibly better latency hiding with more work per thread More threads per block reduces levels in tree of recursive kernel invocations High kernel launch overhead in last levels with few blocks On G8, best perf with 64-56 blocks of 18 threads 14-496 elements per thread 31

Reduction #: Multiple Adds / Thread Replace load and add of two elements: unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blockdim.x*) + threadidx.x; sdata[tid] = g_idata[i] + g_idata[i+blockdim.x]; syncthreads(); With a while loop to add as many as necessary: unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blocksize*) + threadidx.x; unsigned int gridsize = blocksize**griddim.x; sdata[tid] = ; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blocksize]; i += gridsize; syncthreads(); 3

Reduction #: Multiple Adds / Thread Replace load and add of two elements: unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blockdim.x*) + threadidx.x; sdata[tid] = g_idata[i] + g_idata[i+blockdim.x]; syncthreads(); With a while loop to add as many as necessary: unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blocksize*) + threadidx.x; unsigned int gridsize = blocksize**griddim.x; sdata[tid] = ; Note: gridsize loop stride to maintain coalescing! while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blocksize]; i += gridsize; syncthreads(); 33

Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching Time ( ints) 8.54 ms Bandwidth.83 GB/s Step Speedup Cumulative Speedup Kernel : interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s.33x.33x Kernel 3: sequential addressing 1. ms 9.41 GB/s.1x 4.68x Kernel 4: first add during global load.965 ms 1.3 GB/s 1.8x 8.34x Kernel 5: unroll last warp.536 ms 31.89 GB/s 1.8x 15.1x Kernel 6: completely unrolled.381 ms 43.996 GB/s 1.41x 1.16x Kernel : multiple elements per thread.68 ms 6.61 GB/s 1.4x 3.4x Kernel on 3M elements: 3 GB/s! 34

template <unsigned int blocksize> global void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern shared int sdata[]; unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blocksize*) + tid; unsigned int gridsize = blocksize**griddim.x; sdata[tid] = ; Final Optimized Kernel while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blocksize]; i += gridsize; syncthreads(); if (blocksize >= 51) { if (tid < 56) { sdata[tid] += sdata[tid + 56]; syncthreads(); if (blocksize >= 56) { if (tid < 18) { sdata[tid] += sdata[tid + 18]; syncthreads(); if (blocksize >= 18) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; syncthreads(); if (tid < 3) { if (blocksize >= 64) sdata[tid] += sdata[tid + 3]; if (blocksize >= 3) sdata[tid] += sdata[tid + 16]; if (blocksize >= 16) sdata[tid] += sdata[tid + 8]; if (blocksize >= 8) sdata[tid] += sdata[tid + 4]; if (blocksize >= 4) sdata[tid] += sdata[tid + ]; if (blocksize >= ) sdata[tid] += sdata[tid + 1]; if (tid == ) g_odata[blockidx.x] = sdata[]; 35

Performance Comparison Time (ms) 1 1.1 1: Interleaved Addressing: Divergent Branches : Interleaved Addressing: Bank Conflicts 3: Sequential Addressing 4: First add during global load 5: Unroll last warp 6: Completely unroll : Multiple elements per thread (max 64 blocks).1 131 6144 5488 14856 915 419434 838868 1616 3355443 # Elements 36

Types of optimization Interesting observation: Algorithmic optimizations Changes to addressing, algorithm cascading.84x speedup, combined! Code optimizations Loop unrolling.54x speedup, combined 3

Conclusion Understand CUDA performance characteristics Memory coalescing Divergent branching Bank conflicts Latency hiding Use peak performance metrics to guide optimization Understand parallel algorithm complexity theory Know how to identify type of bottleneck e.g. memory, core computation, or instruction overhead Optimize your algorithm, then unroll loops Use template parameters to generate optimal code Questions: mharris@nvidia.com 38