Code Optimizations for High Performance GPU Computing

Similar documents
A GPGPU Compiler for Memory Optimization and Parallelism Management

A GPGPU Compiler for Memory Optimization and Parallelism Management

A GPGPU Compiler for Memory Optimization and Parallelism Management

A Unified Optimizing Compiler Framework for Different GPGPU Architectures

A Unified Optimizing Compiler Framework for Different GPGPU Architectures

Double-Precision Matrix Multiply on CUDA

Programming in CUDA. Malik M Khan

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

ABSTRACT. YANG, YI. Architectural Support and Compiler Optimization for Many-Core Architectures. (Under the direction of Dr. Huiyang Zhou).

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CUDA Performance Considerations (2 of 2)

Lecture 5. Performance Programming with CUDA

Introduction to GPGPU and GPU-architectures

CUDA Performance Optimization. Patrick Legresley

University of Bielefeld

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Computer Architecture

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Tesla Architecture, CUDA and Optimization Strategies

Dense Linear Algebra. HPC - Algorithms and Applications

Unrolling parallel loops

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

1/25/12. Administrative

CPU-Assisted GPGPU on Fused CPU-GPU Architectures

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Lecture 8: GPU Programming. CSE599G1: Spring 2017

High Performance Computing on GPUs using NVIDIA CUDA

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CME 213 S PRING Eric Darve

CS 314 Principles of Programming Languages

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

QR Decomposition on GPUs

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

TUNING CUDA APPLICATIONS FOR MAXWELL

Hardware/Software Co-Design

CS 677: Parallel Programming for Many-core Processors Lecture 6

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Tiled Matrix Multiplication

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Image convolution with CUDA

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

ECE 408 / CS 483 Final Exam, Fall 2014

Threading Hardware in G80

Lecture 2: CUDA Programming

Lab 1 Part 1: Introduction to CUDA

EE/CSCI 451 Midterm 1

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU programming basics. Prof. Marco Bertini

Advanced CUDA Optimizations

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Nuggets

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks

Efficient Integral Image Computation on the GPU

Lecture 1: Introduction and Computational Thinking

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Advanced CUDA Programming. Dr. Timo Stich

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Maximizing Face Detection Performance

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

DRAM Bank Organization

Module Memory and Data Locality

Blocks, Grids, and Shared Memory

Matrix Multiplication in CUDA. A case study

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav


Portland State University ECE 588/688. Graphics Processors

Image Processing Optimization C# on GPU with Hybridizer

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Sparse Linear Algebra in CUDA

Transcription:

Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1

Question to answer Given a task to accelerate some algorithm, e.g., solving PDE, image fltering, etc., using GPU computing How can we start? How can we systematically develop high performance GPGPU programs? 2

Outline Hardware abstraction A systematic approach to developing high performance GPGPU programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 3

Hardware Abstraction of GPU Architecture Based on this simple abstraction, develop a naïve implementation without considering optimizations. Focuses: Data level parallelism Functional correctness 4

GPGPU Architecture Fast (local) communication among processors in a SM through shared memory Memory requests need to be evenly distributed among MCs to avoid conflicts/partition camping 5

Key to Performance Global memory access bandwidth Coalesced global memory accesses Memory partitions Fast data accesses Shared memory Constant Cache Texture Cache Register Balanced resource usage, balanced ILP and TLP Thread level: register usage Thread block level: shared memory usage, 6

Developing High Performance GPGPU Code Naïve code Vectorization for memory access bandwidth Checking memory coalescing Converting non-coalesced accesses into coalesced Checking data dependencies and sharing patterns Thread & thread-block merge Data prefetching Removing memory partition camping High performance code 7

Naïve Kernel Fine-grain data-level parallelism Compute one element/pixel in the output domain Example: Matrix multiplication float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication 8

Physical Meaning of the Naïve Kernel One thread computes one element at (idx, idy) in the product matrix B float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication idx A idy C= AXB A (idx, idy) 9

Outline Hardware abstraction A systematic approach to developing high performance GPGPU Programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 10

Case Study: Convolution (idx, idy) A (idx, idy) C = X B float sum = 0; for (j=0; j<8; j=j+1) { for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } } C[idy][idx] = sum; A is input matrix B is 8x8 filter matrix C is output matrix Naïve version of Convolution one thread computes one output pixel at (idx, idy) 11

Coalesced Global Memory Access Needed by GPU to achieve high memory bandwidth Examined at the half-warp granularity Threads in a warp have consecutive thread ids Requirements for coalesced global memory accesses Aligned: Half of warp threads must access the data with starting address to be a multiple of 64 bytes Sequential (less strict for GTX 280/480): Thread Half of warp 0 threads must access the data 15 sequentially Global memory 128 188 192 12

Checking coalesced memory accesses A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j][i]; sum += a*b; } Inner loop of convolution C Access pattern of B[j][i]: When i = 0; B[j][0] for all the threads in a warp When i = 1; B[j][1] for all the threads in a warp Therefore, it is not coalesced. As B is a small kernel, we can store it in the shared memory or constant memory (cache). 13

Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] 14

Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] 15

Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] When i = 2, A[idy-j][idx-2] 16

Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] for the warp When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] When i = 2, A[idy-j][idx-2] When i = 7, A[idy-j][idx-7] Therefore, it is not coalesced. The warp accesses the data: A[idy-j][idx-7:idx+31] 17

Convert to coalesced accesses Shared memory A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C We preload data into shared memory. Then access it from shared memory. One warp (32 threads) loads 64 floats into shared memory 18

Coalesced memory access shared float shared_0[64]; shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; // load data into shared memory syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; // access data from shared memory float b=b[j][i]; sum+=(a*b); } syncthreads(); 32 threads (one warp) in one thread block Each warp access 64 elements A[idy-j][idx-tidx-32 : idxtidx+31] (idx tidx) is the start position of the thread block 19

Convolution: thread block merge 256 threads only need 256+32 floats from global memory A C = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution Now one warp needs to load 64 floats for the inner loop There are some overlap between neighboring warps/thread blocks A[idy-j][idx-tidx-32 : idx-tidx+31] If we put more warps into one thread blocks, they can share the overlap part and reduce global memory access 20

Thread block merge Parallelism impact Increase thread block workload Keep the thread workload TB before merge TB before merge Shared Data Segment Advantage: Don t increase register pressure Disadvantage : Data must be in shared memory (slower than register) Thread block after thread-block merge Thread Improve memory reuse by merging neighboring thread blocks 21

Code after thread block merge shared float shared_0[256+32]; if (tidx<32) shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; float b=b[j][i]; sum+=(a*b); } syncthreads(); // only first warp executes 256 threads in one thread block 22

Case study: Convolution /////// /////// /////// A C = X float sum = 0; for (j=0; j<8; j=j+1) { } C[idy][idx] = sum; The neighboring threads in Y direction have overlaps in A If we let one thread compute two output pixels in Y direction, we can reduce the data access of A. B Overlap for two output pixels for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Outer loop of Convolution 23

Convolution: thread merge A X B When we load 8 pixels from A (shared memory) = We can do inner loop for one output pixel C Or two pixels Three, or more So after we load data A from shared memory, we can keep it in the register to do more ALU computation 24

Code after thread merge float sum_0 = 0; sum_1 = 0; for (j=0; j<8; j=j+1) { shared float shared_0[256+32]; if (tidx<32) shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; float b_0=b[j][i]; float b_1=b[j+1][i]; // we also compute another output pixel sum_0+=(a*b_0); // code for boundary check is ignored sum_1+=(a*b_1); } syncthreads(); } C[2*idy][idx] = sum; C[2*idy+1][idx] = sum; One thread computes two output pixels 25

Thread merge Parallelism impact Increase thread workload Keep the thread block workload TB before merge Shared Data Segment Thread Shared Register Advantage : Data can be in the register or shared memory Disadvantage : Increase the register pressure for single thread Thread block after thread merge TB before merge Improve memory reuse by merging threads from neighboring thread blocks. 26

Performance (Gflops) 1000 Convolution Performance with 8 x 8 filter matrix on GTX 480 800 600 400 200 0 1k by 1k 2k by 2k 4k by 4k 8k by 8k Input matrix size 70% of theoretic computation power (1.35Tflops) of GTX 480 128 threads in one thread block and one thread computes 184 output pixels. 27

Case study: matrix vector multiplication t0 t1 One thread load one row of A and compute one pixel of C A X B = float sum = 0; for (i=0; i<w; i=i+1) { float a; float b; a = A[idx][i]; b = B[i]; sum += a*b; } C[idx] = sum; Naïve version of MV C 28

Partition camping tb0 tb1 A X B If the width of A is multiple of partition size All thread blocks start from same partition Partition Partition 0 Partition 1 2 = C 29

Eliminating Partition camping tb0 tb1 tb2 A X B Let different thread blocks have different start points Partition 0 Partition 1 = Partition 2 C 30

Code to eliminate partition camping int start = (blockidx.x*16); // different start points for different thread blocks for (i=0; i<w; i=(i+16)) { int k=((start+i)%w); float a; float b; a = A[idx][k]; b = B[k]; sum += a*b; } C[idx]=sum; The un-optimized kernel is used to illustrate the code to remove partition camping Optimized kernel has 32 threads in one thread block and uses shared memory to avoid un-coalesced memory access 31

GFLOPS 45 40 35 30 25 20 15 10 5 0 Matrix vector multiplication on GTX 280 Naïve Opti_PC Optimized CUBLAS2.2 2kx2k 2kx4k 2kx8k 2kx16k 2kx32k 2kx64k 4kx4k 3kx3k Matrix size Opti_PC : the optimized kernel without partition camping elimination 32

Performance (Gflops) Matrix vector multiplication on GTX 480 60 50 Opti_PC optimized cublas 3.1 40 30 20 10 0 1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K 7K X 7K 8K X 8K Matrix Size Opti_PC : the optimized kernel without partition camping elimination. Partition camping elimination benefits 3K and 6K more because GTX 480 has 6 partitions. 33

Compiling for High Performance GPGPU Code: http://code.google.com/p/gpgpucompiler/ Naïve code Vectorization for memory access bandwidth Checking memory coalescing Converting non-coalesced accesses into coalesced Checking data dependencies and sharing patterns Thread & thread-block merge Data prefetching Removing memory partition camping High performance code 34

Outline Hardware abstraction A systematic approach to developing high performance GPGPU programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 35

Leveraging constant cache (GTX 480) Register Benefit: fastest, no latency for ALU Limitation: no sharing between threads Constant cache Benefit: up to 2TBytes/S Limitation: 64kB const memory on GTX 480, sequential broadcast access Shared memory Benefit: sharing in block with index Limitation: up to 1TBytes/S Texture cache Benefit: 2D cache automatically Limitation: up to 334GBytes/S r0 = r1 + r2*shared[k]; One second 1T/4bytes = 0.25T float 0.25T * 2flops = 500 GFlop 36

Case study: Matrix Multiplication with Constant memory C = A * B float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication (one output per thread) All threads with the same idy access input A same location sequentially. From A[idy][0] to A[idy][w-1] How about we put A into constant memory 37

Matrix Multiplication (Tiled) A[0][0] A[0][1] A[0][2] A[1][0] A[1][1] A[1][2] A[2][0] A[2][1] A[2][2] A[3][0] A[3][1] A[3][0] B[0] B[1] B[2] C[i] = A[i][0]*B[0] + A[i][1]*B[1]+A[i][2]*B[2] C[0], C[1], C[2], C[3] can be computed concurrently Let s put A[i][j] into constant memory C[0] C[1] C[2] C[3] 38

Efficient constant memory accesses 128 x 16 A X WidthOfB x 128 B A T = WidthOfC x 16 C When we load one pixel from B We can compute one output pixel Or two, up to 16 so that we can use more computation to overlap memory access B But column access in constant memory is not efficient 39

Matrix Multiplication 128 x 16 A X One thread Load one pixel from B Load one column from A Compute one column of C WidthOfB x 128 = WidthOfC x 16 B C A is 128 x 16 We can put A into const memory (column major) Load a float from B, we can do 16 mad to overlap the memory request of B The width of B determine the overall thread number. 40

Kernel code when A is 128 x 16 int idx = blockidx.x*blockdim.x + threadidx.x; float sum[16]; for (int i=0; i<128; i++) { float b = B[i] [idx]; for (int j=0; j<16; j++) { sum[j] += b*consta_a[i*16+j]; // A is in constant memory } } for (int j=0; j<16; j++) { C[j] [idx] = sum[j]; } Each thread computes 16 output pixels Thread block size: 256 41

Performance (Gflops) 1200 1000 Matrix Mutiplication on GTX 480 (A is 128x16) cublas 3.1 const version *Constant memory transpose and transfer time included 800 600 400 200 0 8192 16384 32768 65536 131072 262144 524288 1048576 Width of B Up to 1.8 times Speedup on CUBLAS 3.1 75% of theoretical computation power (1.35Tflops) of GTX 480 42

Performance (Gflops) 1000 900 800 700 600 500 400 300 200 100 0 Matrix Mutiplication on GTX 480 cublas 3.1 const version *Constant memory transpose and transfer time included 2048 3072 4096 5120 6144 7168 8192 Input size Width, Height of A and B are the same as the input size Up to 1.65X speedups over CUBLAS 3.1 67% of theoretical computation power (1.35Tflops) of GTX 480 43

Conclusion A systematic way to optimize GPGPU programs Naïve kernel based on simplified hardware abstraction Optimizations Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging different types of caches We implement a source to source compiler to perform the optimizations automatically. http://code.google.com/p/gpgpucompiler/ 44