Lecture 2: CUDA Programming

Similar documents
CS 314 Principles of Programming Languages

CS516 Programming Languages and Compilers II

Lecture 3: CUDA Programming & Compiler Front End

CS671 Parallel Programming in the Many-Core Era

Lecture 1: Introduction and Basics

CS516 Programming Languages and Compilers II

Tesla Architecture, CUDA and Optimization Strategies

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Data Parallel Execution Model

Programming in CUDA. Malik M Khan

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming. Dr. Bernhard Kainz

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Fundamental CUDA Optimization. NVIDIA Corporation

Spring Prof. Hyesoon Kim

Computer Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Performance Optimization. Patrick Legresley

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Threading Hardware in G80

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Optimizing Parallel Reduction in CUDA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Warps and Reduction Algorithms

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CUDA Parallelism Model

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CS377P Programming for Performance GPU Programming - II

Portland State University ECE 588/688. Graphics Processors

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Real-time Graphics 9. GPGPU

Mattan Erez. The University of Texas at Austin

CS 179 Lecture 4. GPU Compute Architecture

CS : High-Performance Computing with CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to GPGPU and GPU-architectures

GPU Fundamentals Jeff Larkin November 14, 2016

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

Lecture 5. Performance Programming with CUDA

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

Supercomputing on your desktop:

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Numerical Simulation on the GPU

Efficient Integral Image Computation on the GPU

Advanced CUDA Programming. Dr. Timo Stich

Real-time Graphics 9. GPGPU

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Numerical Algorithms

CS377P Programming for Performance GPU Programming - I

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Cartoon parallel architectures; CPUs and GPUs

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Parallel Computing. Lecture 19: CUDA - I

Multi-Processors and GPU

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Advanced CUDA Optimizations

ME964 High Performance Computing for Engineering Applications

Lecture 3: Introduction to CUDA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Programming Model

Inter-Block GPU Communication via Fast Barrier Synchronization


GPU programming basics. Prof. Marco Bertini

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Improving Performance of Machine Learning Workloads

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

GRAPHICS PROCESSING UNITS

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

Programmable Graphics Hardware (GPU) A Primer

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Module Memory and Data Locality

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Global Memory Access Pattern and Control Flow

Transcription:

CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017

Review: Programming in CUDA Let s look at a sequential program in C first: Add vector A and vector B to vector C void add_vector (float *A, float *B, float *C, int N){ for ( index = 0; index < N; index++ ) C[index] = A[index] + B[index]; void main ( ) { add_vector (A, B, C, N);

Review: Programming in CUDA Now let s look at the parallel implementation in CUDA: Add vector A and vector B to vector C _global_ void add_vector (float *A, float *B, float *C, int N){ if (tid < N) C[tid] = A[tid] + B[tid]; void main ( ) { add_vector <<<dimgrid, dimblock>>> (A, B, C, N); /ilab/users/zz124/cs515_2017/samples/0_simple/vectoradd

What is different? CPU Program GPU Program Add vector A and vector B to vector C Add vector A and vector B to vector C void add_vector (float *A, float *B, float *C, int N){ for ( index = 0; index < N; index++ ) C[index] = A[index] + B[index]; _global_ void add_vector (float *A, float *B, float *C, int N){ if ( tid < N ) C[tid] = A[tid] + B[tid]; void main ( ) { add_vector (A, B, C, N); void main ( ) { add_vector <<<dimgrid, dimblock>>> (A, B, C, N);

Review: Programming in CUDA Now the parallel implementation in CUDA: Device Code (GPU): _global_ void add_vector (float *A, float *B, float *C, int N){ tid = blockdimx * blockidxx + threadidxx; if (tid < N) C[tid] = A[tid] + B[tid]; Host Code (CPU): void main ( ) { //allocate memory on GPU and initialization add_vector <<<dimgrid, dimblock>>> (A, B, C, N); // free memory on GPU Sample code: /ilab/users/zz124/cs515_2017/samples/0_simple/vectoradd

Review: Programming in CUDA CPU Computation and GPU Computation: CPU Work CPU Work

Review: Thread View A kernel is executed as a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: - Synchronizing their execution - Efficiently sharing data through a low latency shared memory (scratch-pad memory) - Two threads from two different blocks cannot cooperate by shared memory A thread block is organized as a set of warps - A warp is a SIMD thread group - A warp has 32 threads Kernels can also run simultaneously if specified as asynchronous streams

Review: Single Instruction Multiple Data (SIMD) C[index] = A[index] + B[index]; thread 0: C[0] = A[0] + B[0]; thread 1: C[1] = A[1] + B[1]; thread 15: C[15] = A[15] + B[15]; Flynn s taxonomy for modern processors Single Instruction Single Data (SISD) Single Instruction Multiple Data (MIMD) Multiple Instruction Single Data (MISD) Multiple Instruction Multiple Data (MIMD) GPU SIMT (Single Instruction Multiple Thread) Simultaneous execution of many SIMD thread groups

Review: GPU Architecture Nvidia Geforce 8800 GTX - 128 streaming processors (SP) each at 575 MHz - Each streaming multiprocessor (SM) has 16 KB shared memory

Review: Thread Blocks on a SM Threads are assigned to SMs in BLOCK granularity - The number of blocks per SM limited by resource constraints - One SM in C2075 can host up to1536 threads: 256 (threads/block) * 6 blocks or 128 (threads/block) * 12 blocks, and etc Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution

Review: Thread Life Cycle in Hardware Thread blocks are distributed to all the SMs - Typically more than 1 block per SM One Streaming Multi-processor schedules and executes warps that are ready As a thread block completes, resources are freed - More thread blocks will be distributed to SMs

Review: Thread Scheduling and Execution Warps are scheduling units within a SM - If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there? - 24 At any point in time, only a limited number of those warps bill be selected for execution on one SM

Review: SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling - Warps whose next instruction has its operands ready for consumption are eligible for execution - Eligible warps are selected for execution on a prioritized scheduling (round-robin/age) - All threads in a warp execute the same instruction when selected Typically 4 clock cycles needed to dispatch the same instruction for all threads in a warp time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96

Review: SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Question: Assume one global memory access is needed for every 4 instructions, at one time only one instruction can be issued, and it takes 4 cycles to dispatch an instruction for one warp What is the minimum number of warps are needed to tolerate the 320- cycle global memory latency?

Review: SM Warp Scheduling Warp 1 Warp 2 Question: Global Memory Latency 320-cycle Warp 3 Assume one global memory access is needed for every 4 instructions, at one time only one instruction can be issued, and it takes 4 cycles to dispatch an instruction for one warp What is the minimum number of warps are needed to tolerate the 320-cycle global memory latency?

CUDA Memory Model On-chip memory - Registers - Shared Memory (SMEM) - Instruction Cache, Constant Cache, Texture/L1 Cache - Small latencies, ie, ~ 20 cycles or less - L2 (last level cache), ~ 400-1000 cycles latency L1/Texture/Constant Cache Global/Constant/Texture/Local Memory Off-chip memory - Local Memory: register spills, activation record, thread-private variable - Constant Memory: automatically maps to a constance cache by hardware - Texture Memory: automatically maps to texture cache by hardware - Global memory: main device memory, read/write, visible to all threads in the same GPU program Optional Reading: Wong et al Demystifying GPU microarchitecture through microbenchmarking ISPASS 10

CUDA Software Cache Software Cache (SMEM) - A thread block gets equal-size shared memory partition - Requires explicit shared memory allocation - Requires explicit data movement management L1/Texture/Constant Cache Global/Constant/Texture/Local Memory Software Cache vs Hardware Cache - Replacement: implicit by hardware VS explicit by software - Cache eviction policy: least recent use (LRU) policy VS programmer specified - Addressing model: complex hash function VS programmer specified - Partition model: no fixed-size partition VS fixed-size partition for different thread blocks - Interaction with concurrency: no impact on concurrency VS affects concurrency level Optional Reading: Silberstein etal Efficient computation of sum-products on GPUs through software-managed cache, ICS 08

CUDA Synchronization Primitive Within a thread block - Barrier: syncthreads( ) - All computation by threads in the thread block before the barrier complete before any computation by threads after the barrier begins - Barriers are a conservative way to express dependencies - Barriers divide computation into phases - In conditional code, the condition must be uniform across the block Across thread blocks - Implicit synchronization at the end of a GPU kernel - A barrier in the middle of a GPU kernel can be implemented using atomic compare-and-swap (CAS) and memory fence operations Optional Reading: Xiao etal Inter-block GPU communication via fast barrier synchronization, IPDPS 10

Case Study 1: Matrix Transpose

Case Study 1: Matrix Transpose

Case Study 1: Matrix Transpose Simple Implementation global void transpose_naive(float * outputdata, float * inputdata, int width, int height) { unsigned int xindex = blockdimx * blockidxx + threadidxx; unsigned int yindex = blockdimy * blockidxy + threadidxy; if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; One thread is assigned to one a cell in the input matrix outputdata[ index_out ] = inputdata[ index_in ];

Performance Tuning Global Memory Coalescing Data is fetched as a contiguous memory chunk (a cache block/line) - A cache block/line is typically 128 byte - The purpose is to utilize the wide DRAM burst for maximum memory level parallelism (MLP) A thread warp is ready when all its threads operands are ready - Coalescing data operands for a thread warp into as few contiguous memory chunks as possible - Assume we have a memory access pattern for a thread = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7 P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2 Coalesced Accesses tid: 0 1 2 3 4 5 6 7 Non-Coalesced Accesses tid: 0 1 2 3 4 5 6 7 A[ ]: A[ ]: { { a cache block a cache block

Case Study 1: Matrix Transpose Any non-coalesced memory access? global void transpose_naive(float * outputdata, float * inputdata, int width, int height) { unsigned int xindex = blockdimx * blockidxx + threadidxx; unsigned int yindex = blockdimy * blockidxy + threadidxy; if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; One thread is assigned to one a cell in the input matrix outputdata[ index_out ] = inputdata[ index_in ];

Case Study 1: Matrix Transpose Any non-coalesced memory access? global void transpose_naive(float * outputdata, float * inputdata, int width, int height) { unsigned int xindex = blockdimx * blockidxx + threadidxx; unsigned int yindex = blockdimy * blockidxy + threadidxy; if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; One thread is assigned to one a cell in the input matrix outputdata[ index_out ] = inputdata[ index_in ]; coalesced? coalesced?

Case Study 1: Matrix Transpose Any non-coalesced memory access? global void transpose_naive(float * outputdata, float * inputdata, { int width, int height) 1 Million Elements Naive Implementation unsigned int xindex = blockdimx * blockidxx + threadidxx; Performance: Throughput = 80226 GB/s, Time = 097381 ms unsigned int yindex = blockdimy * blockidxy + threadidxy; if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; outputdata[ index_out ] = inputdata[ index_in ]; non-coalesced coalesced Improved implementation One thread is assigned to Performance: Throughput = 115080 GB/s, Time = 067887 ms one a cell in the input matrix

Case Study 1: Matrix Transpose An Improved Implementation global void transpose(float *outputdata, float *inputdata, int width, int height){ shared float block[block_dim][block_dim + 1]; unsigned int xindex = blockidxx * BLOCK_DIM + threadidxx; unsigned int yindex = blockidxy * BLOCK_DIM + threadidxy; if( ( xindex < width) && (yindex < height) ){ unsigned int index_in = yindex * width + xindex; block[threadidxy][threadidxx] = inputdata[index_in]; in global memory in shared memory in global memory syncthreads(); xindex = blockidxy * BLOCK_DIM + threadidxx; Use shared memory to improve global memory coalescing yindex = blockidxx * BLOCK_DIM + threadidxy; if( (xindex < height) && (yindex < width)) { unsigned int index_out = yindex * height + xindex; outputdata[index_out] = block[threadidxx][threadidxy];

Case Study 1: Matrix Transpose An Improved Implementation global void transpose(float *outputdata, float *inputdata, int width, int height){ shared float block[block_dim][block_dim+1]; unsigned int xindex = blockidxx * BLOCK_DIM + threadidxx; unsigned int yindex = blockidxy * BLOCK_DIM + threadidxy; if( ( xindex < width) && (yindex < height) ){ unsigned int index_in = yindex * width + xindex; block[threadidxy][threadidxx] = inputdata[index_in]; in global memory in shared memory in global memory syncthreads(); xindex = blockidxy * BLOCK_DIM + threadidxx; Use shared memory to improve global memory coalescing yindex = blockidxx * BLOCK_DIM + threadidxy; if( (xindex < height) && (yindex < width)) { unsigned int index_out = yindex * height + xindex; both read/write accesses are coalesced outputdata[index_out] = block[threadidxx][threadidxy];

Case Study 1: Matrix Transpose An Improved Implementation global void transpose(float *outputdata, float *inputdata, int width, int height){ shared float block[block_dim][block_dim+1]; unsigned int xindex = blockidxx * BLOCK_DIM + threadidxx; unsigned int yindex = blockidxy * BLOCK_DIM + threadidxy; if( ( xindex < width) && (yindex < height) ){ unsigned int index_in = yindex * width + xindex; block[threadidxy][threadidxx] = inputdata[index_in]; syncthreads(); xindex = blockidxy * BLOCK_DIM + threadidxx; yindex = blockidxx * BLOCK_DIM + threadidxy; if( (xindex < height) && (yindex < width)) { unsigned int index_out = yindex * height + xindex; outputdata[index_out] = block[threadidxx][threadidxy]; 1 Million Elements Naive Implementation Performance: Throughput = 80226 GB/s, Time = 097381 ms Improved implementation in global memory in shared memory in global memory Use shared memory to improve global memory coalescing Performance: Throughput = 115080 GB/s, Time = 067887 ms both read/write accesses are coalesced

Case Study 1: Matrix Transpose An Improved Implementation global void transpose(float *outputdata, float *inputdata, int width, int height){ shared float block[block_dim][block_dim+1]; unsigned int xindex = blockidxx * BLOCK_DIM + threadidxx; unsigned int yindex = blockidxy * BLOCK_DIM + threadidxy; if( ( xindex < width) && (yindex < height) ){ unsigned int index_in = yindex * width + xindex; block[threadidxy][threadidxx] = inputdata[index_in]; syncthreads(); xindex = blockidxy * BLOCK_DIM + threadidxx; yindex = blockidxx * BLOCK_DIM + threadidxy; Code: /ilab/users/zz124/cs515_2017/samples/6_advanced/transpose if( (xindex < height) && (yindex < width)) { unsigned int index_out = yindex * height + xindex; outputdata[index_out] = block[threadidxx][threadidxy]; in global memory in shared memory in global memory Use shared memory to improve global memory coalescing both read/write accesses are coalesced

GPU Performance Hazards

Performance Hazard I Control Divergence Thread Divergence Caused by Control Flows How about this? tid: 0 1 2 3 4 5 6 7 A[ ]: 0 0 6 0 0 2 4 1 if (A[tid]) { do some work; else { do nothing; OR this? for (i=0;i<a[tid]; i++) { May degrade performance by up to warp_size times (warp size = 32 in modern GPUs) Optional Reading: Zhang etal On-the-Fly Elimination of Dynamic Irregularities for GPU Computing, ASPLOS 11

Performance Hazard II Global Memory Coalescing Data is fetched as a contiguous memory chunk (a cache block/line) - A cache block/line is typically 128 byte - The purpose is to utilize the wide DRAM burst for maximum memory level parallelism (MLP) A thread warp is ready when all its threads operands are ready - Coalescing data operands for a thread warp into as few contiguous memory chunks as possible - Assume we have a memory access pattern for a thread = A[P[tid]]; P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7 P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2 Coalesced Accesses tid: 0 1 2 3 4 5 6 7 Non-Coalesced Accesses tid: 0 1 2 3 4 5 6 7 A[ ]: A[ ]: { { a cache block a cache block

Performance Hazard III: Shared Memory Bank Conflicts Shared Memory - Typically 16 or 32 banks - Successive 32-bit words are assigned to successive banks - A fixed stride access may cause bank conflicts - Maximizing bank level parallelism is important Threads Banks Reduce Bank Conflicts - Padding, might waste some shared memory space - Data layout transformation, ie, array of struct to structor of array - Thread-data remapping, ie, the tasks that are mapped to different banks are executed at the same time

Performance Hazard III: Shared Memory Bank Conflicts

Performance Hazard III: Shared Memory Bank Conflicts

Case Study 2 Parallel Reduction What is a reduction operation? - A binary operation and a roll-up operation - Let A = [a0, a1, a2, a3,, a(n-1)], let be an associative binary operator with identify element I reduce(, A) = [a0 a1 a2 a3 an] 3 1 7 0 4 1 6 3 int sum = 0; for ( int i = 0; i < N; i++ ) 4 7 5 9 sum = sum + A[i]; 11 14 25 Sequential Parallel

Case Study 2 Parallel Reduction First Implementation tid = threadidxx; for ( s=1; s < blockdimx; s *= 2) { if ( tid % (2*s) == 0 ) { sdata[ tid ] += sdata[ tid + s ]; syncthreads(); Any GPU performance hazard here? Thread Divergence

Case Study 2 Parallel Reduction First Implementation Reducing 16 Million Elements tid = threadidxx; Implementation 1: for ( s=1; Performance: s < blockdimx; Throughput s *= 2) = 15325 GB/s, Time = 004379 s { if ( tid % (2*s) == 0 ) { Implementation 2: sdata[ Performance: tid ] += sdata[ Throughput tid + s ]; = 19286 GB/s, Time = 003480 s Implementation 3: syncthreads(); Performance: Throughput = 26062 GB/s, Time = 002575 s Any GPU performance hazard here? Thread Divergence

Case Study 2 Parallel Reduction Reduce Thread Divergence Second Version First Version Thread-Task Mapping Changed!

Case Study 2 Parallel Reduction Reduce Thread Divergence tid = threadidxx; tid = threadidxx; for (s=1; s < blockdimx; s *= 2) { for ( s=1; s < blockdimx; s *= 2{ int index = 2 * s * tid; if ( index < blockdimx) { sdata[index] += sdata[index + s]; if ( tid % (2*s) == 0 ) { sdata[ tid ] += sdata[ tid + s ]; syncthreads(); syncthreads(); Second Version First Version Thread-Task Mapping Changed!

Case Study 2 Parallel Reduction Implementation 2 Reducing 16 Million Elements tid = threadidxx; Implementation 1: for (s=1; s < blockdimx; s *= 2) { Performance: Throughput = 15325 GB/s, Time = 004379 s int index = 2 * s * tid; if ( index < blockdimx) { Implementation 2: sdata[index] += sdata[index + s]; Performance: Throughput = 19286 GB/s, Time = 003480 s Implementation 3: syncthreads(); Performance: Throughput = 26062 GB/s, Time = 002575 s Any performance hazard? Potential Shared Memory Bank Conflicts

Case Study 2 Parallel Reduction Reduce Shared Memory Bank Conflicts Third Version Second Version Let Logically Contiguous Thread Access Contiguous Shared Memory Locations

Case Study 2 Parallel Reduction Third Implementation (utilize implicit intra-warp synchronization) tid = threadidxx; Reducing 16 Million Elements Implementation 1: for ( s = blockdimx/2; s>0; s >>=1 ) { Performance: Throughput = 15325 GB/s, Time = 004379 s if ( tid < s ) { Implementation 2: sdata[tid] += sdata[tid + s]; Performance: Throughput = 19286 GB/s, Time = 003480 s syncthreads(); Implementation 3: Performance: Throughput = 26062 GB/s, Time = 002575 s Code: /ilab/users/zz124/cs515_2017/samples/6_advanced/reduction

Reading Assignments Optimization Parallel Reduction in CUDA by Mark Harris, NVIDIA http://developerdownloadnvidiacom/compute/cuda/11-beta/x86_website/ projects/reduction/doc/reductionpdf Parallel Prefix Sum (Scan) with CUDA, GPU Gem3, Chapter 39, by Mark Harris, Shubhabrata Sengupta, and John Owens http://httpdevelopernvidiacom/gpugems3/gpugems3_ch39html Performance Optimization by Paulius Micikevicius, NVIDIA http://wwwnvidiacom/docs/io/116711/sc11-perf-optimizationpdf CUDA C/C++ Basics http://wwwnvidiacom/docs/io/116711/sc11-cuda-c-basicspdf Loop unrolling for GPU Computing http://wwwnvidiacom/docs/io/116711/sc11-unrolling-parallel-loopspdf CUDA Programming Guide, Chapter 1-5 CUDA C Best Practices Guide http://docsnvidiacom/cuda/cuda-c-best-practices-guide/

Optional Reading Assignments Wong et al Demystifying GPU microarchitecture through micro-benchmarking ISPASS 10 Silberstein et al Efficient computation of sum-products on GPUs through software-managed cache ICS 08 Xiao etal Inter-block GPU communication via fast barrier synchronization IPDPS 10 Zhang et al On-the-Fly Elimination of Dynamic Irregularities for GPU Computing ASPLOS 11