Unrolling parallel loops

Similar documents
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Device Memories and Matrix Multiplication

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Introduction to GPGPU and GPU-architectures

CS 179 Lecture 4. GPU Compute Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

Dense Linear Algebra. HPC - Algorithms and Applications

Fundamental CUDA Optimization. NVIDIA Corporation

Tesla Architecture, CUDA and Optimization Strategies

Lecture 2: CUDA Programming

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Optimizing Parallel Reduction in CUDA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Advanced CUDA Optimizations

Code Optimizations for High Performance GPU Computing

Warps and Reduction Algorithms

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Image Processing Optimization C# on GPU with Hybridizer

Tiled Matrix Multiplication

High Performance Computing on GPUs using NVIDIA CUDA

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

GPU Programming Using NVIDIA CUDA

Sparse Linear Algebra in CUDA

Matrix Multiplication in CUDA. A case study

GPU programming. Dr. Bernhard Kainz

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Hands-on CUDA Optimization. CUDA Workshop

University of Bielefeld

CUDA Parallelism Model

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU MEMORY BOOTCAMP III

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

1/25/12. Administrative

Programming in CUDA. Malik M Khan

Hardware/Software Co-Design

ECE 408 / CS 483 Final Exam, Fall 2014

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Supporting Data Parallelism in Matcloud: Final Report

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

A case study of performance portability with OpenMP 4.5

Double-Precision Matrix Multiply on CUDA

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Computer Architecture

B. Tech. Project Second Stage Report on

GPGPUGPGPU: Multi-GPU Programming

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 3: Introduction to CUDA


Fundamental Optimizations

TUNING CUDA APPLICATIONS FOR MAXWELL

COSC 6374 Parallel Computations Introduction to CUDA

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Module Memory and Data Locality

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA Architecture & Programming Model

Parallel Numerical Algorithms

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Inter-Block GPU Communication via Fast Barrier Synchronization

CME 213 S PRING Eric Darve

Advanced CUDA Programming. Dr. Timo Stich

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance Programming with CUDA

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Numerical Simulation on the GPU

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Programmable Graphics Hardware (GPU) A Primer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Introduction to CUDA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computing. Lecture 19: CUDA - I

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Memory access patterns. 5KK73 Cedric Nugteren

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Transcription:

Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1

Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2

Mapping to GPU: it starts with a loop for( int i = 0; i < n; i++ ) a[i] = b[i] + c[i]; global void add( int *a, int *b, int *c ) { int i = threadidx.x + blockidx.x * griddim.x; a[i] = b[i] + c[i]; } GPU kernel One loop iteration is mapped to one GPU thread 3

What if you unroll the loop before mapping? 4

Unroll the loop first for( int i = 0; i < n; i++ ) a[i] = b[i] + c[i]; for( int i = 0; i < n; i += 2 ) { a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; } 2x fewer iterations, 2x more work per iteration 5

and then map to GPU? global void add( int *a, int *b, int *c ) { } int i = 2*(threadIdx.x+blockIdx.x*gridDim.x); a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; 2x fewer threads, 2x more work per thread But why would you ever do that? GPU kernel 6

Agenda: I. Speedup in molecular dynamics kernel II. Speedup in radio astronomy kernel III. Case study: a linear algebra kernel 7

Example: molecular dynamics One of the first works in CUDA: Stone et al. 2007. Accelerating molecular modeling applications with graphics processors, JCC 28, 16, 2618 2640. Found that 8x unrolling gives 2x speedup 8

Charged particles on 3D grid q 1, r 1 q n, r n q 2, r 2 charge q i positionr i Goal: Compute electric potential on grid 9

Pseudo-code for the problem for each grid point i: for each particle j: add up j scontribution to i store the result 10

Parallelize the outer loop for each grid point iin parallel: for each particle j: add up j scontribution to i store the result 11

Mapping: one grid point per thread thread i thread j 19x faster than optimized CPU code Can we do better? 12

Unroll the parallel loop for every secondgrid point iin parallel: for each particle j: add up j scontribution to i add up j scontribution to i+1 store the both results 13

Multiplegrid points per thread thread i thread j 14

Unrolling results in 2.2x speedup Grid points per thread Speedup vs quad core CPU 1 19 4 39 8 41 A substantial speedup for a simple optimization! 15

Radio astronomy One of the later papers: Clark et al. 2011. Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units, arxiv:1107.4264v2. 1.8x speedup by doing 4x more work per thread 16

Array of antennas Many antennas work as a single telescope For the cost of extra processing power Input data: signal from each antenna Output: cross-correlation Different frequencies are processed separately 17

Parallelized pseudo-code for each pair of antennas iand jin parallel: for each time sample t: S : = S + ij ij store the result X i ( t) X * ( t) j 18

Unrolling the loops for each pair of even iand jin parallel: for each time sample t: S : = S X ( t) X * ( t) ij ij + i j * i + 1, j : = Si+ 1, j + X i+ 1( t) X j ( t ) * i, j+ 1 : = Si, j+ 1 + X i ( t) X j+ 1( t) * i + 1, j+ 1 : = Si+ 1, j+ 1 + X i+ 1( t) X j+ 1( t S + S S store the result ) (mapped to threads) 19

Unrolling results in 1.8x speedup Matrix entries per thread Gflop/s 1x1 562 2x1 852 2x2 1023 Reason: data reuse in local variables Blocking in GPU matrix multiply was used before CUDA, see: Moravánszky, A. 2003. Dense Matrix Algebra on the GPU. 20

Case study: Small linear solves Solve many independent 32x32 s.p.d. systems Ax=b Solve one system per thread block Minimum flop solution: Cholesky+triangular solve Challenging to implement efficiently in SIMD Use Gauss-Jordan instead, no pivoting Drawback: does 6x more flops than Cholesky Here: omit right-hand side Easy to add back with little overhead (1.2x slowdown) Target platform: GTX480, CUDA 4.0 21

Baseline solution shared float A[32][32]; global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; //copy matrix to shared memory A[y][x] = in[32*32*problem+32*y+x]; //run Gauss-Jordan in shared memory (see next slide) #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i ) A[y][x] /= A[i][i]; syncthreads( ); if( y!= i ) A[y][x] -= A[y][i]*A[i][x]; } } //copy result to global memory out[32*32*problem+32*y+x] = A[y][x]; 22

Gauss-Jordan in shared memory 1. Scale the pivot row 2. Subtract it from every other row 1 0 0 x x x x x 0 1 0 x x x x x 0 0 1 x x x x x 0 0 0 X x x x x 0 0 0 x x x x x 0 0 0 x x x x x 0 0 0 x x x x x 0 0 0 x x x x x Get 1on diagonal 1 0 0 x x x x x 0 1 0 x x x x x 0 0 1 x x x x x 0 0 0 1 x x x x 0 0 0 x x x x x 0 0 0 x x x x x 0 0 0 x x x x x 0 0 0 x x x x x 1 0 0 X x x x x 0 1 0 X x x x x 0 0 1 X x x x x 0 0 0 1 x x x x 0 0 0 X x x x x 0 0 0 X x x x x 0 0 0 X x x x x 0 0 0 X x x x x Get 0off diagonal 1 0 0 0 x x x x 0 1 0 0 x x x x 0 0 1 0 x x x x 0 0 0 1 x x x x 0 0 0 0 x x x x 0 0 0 0 x x x x 0 0 0 0 x x x x 0 0 0 0 x x x x for( int i = 0; i < 32; i++ ) { if( y == i ) A[y][x] /= A[i][i]; syncthreads( ); if( y!= i ) A[y][x] -= A[y][i]*A[i][x]; //no syncthreads( ) needed here } 23

Unroll the parallel loop Use half as many threads But twice as much work per thread This amounts to replicating lines of code 24

Unrolling 2x (redis new) global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; //copy matrix to shared memory A[2*y+0][x] = in[32*32*problem+32*(2*y+0)+x]; A[2*y+1][x] = in[32*32*problem+32*(2*y+1)+x]; //Gauss-Jordan in shared memory #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i/2 ) A[i][x] /= A[i][i]; syncthreads( ); if( 2*y+0!= i ) A[2*y+0][x] -= A[i][x]*A[2*y+0][i]; if( 2*y+1!= i ) A[2*y+1][x] -= A[i][x]*A[2*y+1][i]; } } //store the result in global memory out[32*32*problem+32*(2*y+0)+x] = A[2*y+0][x]; out[32*32*problem+32*(2*y+1)+x] = A[2*y+1][x]; 25

Unrolling 4x global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; A[4*y+0][x] = in[32*32*problem+32*(4*y+0)+x]; A[4*y+1][x] = in[32*32*problem+32*(4*y+1)+x]; A[4*y+2][x] = in[32*32*problem+32*(4*y+2)+x]; A[4*y+3][x] = in[32*32*problem+32*(4*y+3)+x]; // do 4x more work... #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i/4 ) A[i][x] /= A[i][i]; syncthreads( ); if( 4*y+0!= i ) A[4*y+0][x] -= A[i][x]*A[4*y+0][i]; if( 4*y+1!= i ) A[4*y+1][x] -= A[i][x]*A[4*y+1][i]; if( 4*y+2!= i ) A[4*y+2][x] -= A[i][x]*A[4*y+2][i]; if( 4*y+3!= i ) A[4*y+3][x] -= A[i][x]*A[4*y+3][i]; } out[32*32*problem+32*(4*y+0)+x] = A[4*y+0][x]; 26

Same but shorter global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; for( int j = 4*y; j < 4*(y+1); j++ ) // unrolled by compiler A[j][x] = in[32*32*problem+32*j+x]; #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i/4 ) A[i][x] /= A[i][i]; syncthreads( ); for( int j = 4*y; j < 4*(y+1); j++ ) if( j!= i ) A[j][x] -= A[j][i]*A[i][x]; } } for( int j = 4*y; j < 4*(y+1); j++ ) out[32*32*problem+32*j+x] = A[j][x]; 27

Unrolling 8x global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; for( int j = 8*y; j < 8*(y+1); j++ ) A[j][x] = in[32*32*problem+32*j+x]; #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i/8 ) A[i][x] /= A[i][i]; syncthreads( ); for( int j = 8*y; j < 8*(y+1); j++ ) if( j!= i ) A[j][x] -= A[j][i]*A[i][x]; } } for( int j = 8*y; j < 8*(y+1); j++ ) out[32*32*problem+32*j+x] = A[j][x]; 28

Unrolling 16x global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; for( int j = 16*y; j < 16*(y+1); j++ ) A[j][x] = in[32*32*problem+32*j+x]; #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i/16 ) A[i][x] /= A[i][i]; syncthreads( ); } #pragma unroll // have to be explicit for heavy unrolling for( int j = 16*y; j < 16*(y+1); j++ ) if( j!= i ) A[j][x] -= A[j][i]*A[i][x]; } for( int j = 16*y; j < 16*(y+1); j++ ) out[32*32*problem+32*j+x] = A[j][x]; 29

Aggregate speedup: 2.6x Gflop/s 160 140 120 100 80 60 40 20 0 1 2 4 8 16 matrix entries per thread Let s use profiler to figure out what happened 30

Profiler statistics per thread Elements per thread Gflop/s Registers per thread Instructions executed per thread Occupancy 1 62 8 397 0.67 2 135 9 470 1.00 4 153 11 783 1.00 8 158 15 1431 0.67 16 159 21 2740 0.33 More resources consumed per thread Occupancy goes up and down Doesn t really explain the speedup 31

Profiler statistics perthread block Threads per block Gflop/s Registers per block Instructions per thread block Thread blocks per multiprocessor 1024 62 8192 12704 1 512 135 4608 7520 3 256 153 2816 6264 6 128 158 1920 5724 8 64 159 1344 5480 8 Fewer resources used per thread block i.e. per same amount of work More concurrent thread blocks Fewer instructions per each solve 32

Poor latency hiding w/ 1 thread block per SM First thing block does access global memory Can t do any computing until data comes So, can t hide latency Need to have at least 2 concurrent blocks Not possible if using 32x32 thread blocks 33

Instruction throughput 1.5 Instructio ons Per Cycle 1.0 0.5 1 threadblock per multiprocessor 0.0 1 2 4 8 16 matrix entries per thread Instruction throughput decreases but runs faster? 34

Instructions executed per thread block 0 2000 4000 6000 8000 10000 12000 Element ts per thread 1 2 4 8 16 smem access flops other Dramatically fewer auxiliary instructions(control, barriers, etc.) Similar effect as with classical loop unrolling Most instructions are shared memory access?! 35

Why so many shared memory accesses? How many instructions is this: A[y][x] -= A[y][i]*A[i][x]; 1 arithmetic instruction (FMA) 3 loads, 1 store Note: each load costs 2 arithmetic instructions 32 banks vs 32 streaming processors But run at half clock rate These 3 loads are 6x more expensive than 1 FMA Eliminate some? 36

Look for reuse global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; for( int j = 8*y; j < 8*(y+1); j++ ) A[j][x] = in[32*32*problem+32*j+x]; #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i/8 ) A[i][x] /= A[i][i]; syncthreads( ); for( int j = 8*y; j < 8*(y+1); j++ ) if( j!= i ) A[j][x] -= A[j][i]*A[i][x]; } } for( int j = 8*y; j < 8*(y+1); j++ ) out[32*32*problem+32*j+x] = A[j][x]; 37

Reuse local copies instead global void eliminate( float *in, float *out ) { int x = threadidx.x, y = threadidx.y, problem = blockidx.x; float a[8]; // array in registers for( int j = 0; j < 8; j++ ) a[j] = A[8*y+j][x] = in[32*32*problem+32*(8*y+j)+x]; #pragma unroll for( int i = 0; i < 32; i++ ) { if( y == i/8 ) A[i][x] = a[i%8] /= A[i][i]; syncthreads( ); float Aix = A[i][x]; for( int j = 0; j < 8; j++ ) if( 8*y+j!= i ) A[8*y+j][x] = a[j] -= A[8*y+j][i]*Aix; } } for( int j = 0; j < 8; j++ ) out[32*32*problem+32*(8*y+j)+x] = a[j]; 38

The effect: further 1.8x speedup 300 reuse: 280 Gflop/s 250 Gfl lop/s 200 150 100 50 no reuse: 159 Gflop/s baseline: 62 Gflop/s 0 1 2 4 8 16 matrix entries per thread 39

Conclusion Simple optimization technique Resembles loop unrolling Often results in 2x speedup 40