ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

Similar documents
Scan Algorithm Effects on Parallelism and Memory Conflicts

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

CUDA Parallel Prefix Sum (1A) Young W. Lim 4/5/14

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

CUDA. Sathish Vadhiyar High Performance Computing

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

CUDA Performance Considerations (2 of 2)

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CS 677: Parallel Programming for Many-core Processors Lecture 6

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 11 Parallelism in Software II

Parallel Patterns Ezio Bartocci

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

1/25/12. Administrative

CS 179: GPU Programming. Lecture 7

CS671 Parallel Programming in the Many-Core Era

GPU Memory Memory issue for CUDA programming

Evolving a CUDA Kernel from an nvidia Template

Evolving a CUDA Kernel from an nvidia Template

Optimizing Parallel Reduction in CUDA

Nvidia G80 Architecture and CUDA Programming

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

CS 314 Principles of Programming Languages

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Efficient Integral Image Computation on the GPU

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Warps and Reduction Algorithms

Module Memory and Data Locality

Breadth-first Search in CUDA

Double-Precision Matrix Multiply on CUDA

Module Memory and Data Locality

Scan Primitives for GPU Computing

GPGPU: Parallel Reduction and Scan

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Programming in CUDA. Malik M Khan

Tiled Matrix Multiplication

Lecture 5. Performance Programming with CUDA

Scan Primitives for GPU Computing

CS377P Programming for Performance GPU Programming - II

Data Parallel Execution Model

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Matrix Multiplication in CUDA. A case study

Introduction to GPGPU and GPU-architectures

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

GPU Computing with CUDA

Real-time Graphics 9. GPGPU

Hardware/Software Co-Design

Lecture 2: CUDA Programming

Update on Angelic Programming synthesizing GPU friendly parallel scans

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Device Memories and Matrix Multiplication

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Dense Linear Algebra. HPC - Algorithms and Applications

COSC 6374 Parallel Computations Introduction to CUDA

Real-time Graphics 9. GPGPU

Maximizing Face Detection Performance

Lessons learned from a simple application

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to GPU programming. Introduction to GPU programming p. 1/17

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Advanced CUDA Optimizations

Global Memory Access Pattern and Control Flow

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

CS516 Programming Languages and Compilers II

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Image convolution with CUDA

Code Optimizations for High Performance GPU Computing

Sparse Linear Algebra in CUDA

More CUDA. Advanced programming

Lecture 11 Parallel Computation Patterns Reduction Trees

GPU-accelerated data expansion for the Marching Cubes algorithm

Unrolling parallel loops

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CUDA Performance Optimization. Patrick Legresley

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Many-core Processors Lecture 4. Instructor: Philippos Mordohai Webpage:

Master Informatics Eng.

Computation to Core Mapping Lessons learned from a simple application

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Massively Parallel Architectures

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Transcription:

ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different constraints dominate in different application situations Case studies help to establish intuition, idioms and ideas Algorithm patterns that can result in both better efficiency as well as better HW utilization This lecture covers simple case studies on useful strategies for tuning CUDA application performance on G80. 2

Some Performance Lessons from Matrix Multiplication 3 Multiply Using Several Blocks One block computes one square sub-matrix P sub of size BLOCK_WIDTH One thread computes one element of P sub Assume that the dimensions of M and N are multiples of BLOCK_WIDTH and square shape by 0 1 0 1 2 ty bsize-1 M N P bx 0 1 2 tx 012 bsize-1 P sub BLOCK_WIDTHBLOCK_WIDTH BLOCK_SIZE WIDTH WIDTH 2 BLOCK_WIDTH BLOCK_WIDTH WIDTH BLOCK_WIDTH WIDTH 4

First-order Size Considerations Each thread block should have a minimal of 96 (768/8) threads TILE_WIDTH of 16 gives 16*16 = 256 threads A minimal of 64 thread blocks A 1024*1024 P Matrix at TILE_WIDTH 16 gives 64*64 = 4096 Thread Blocks Each thread block perform 2*256 = 512 float loads from device memory for 256 * (2*16) = 8,192 mul/add operations. 5 Shared Memory Usage Each SMP has 16KB shared memory Each Thread Block uses 2*256*4B = 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively executing For BLOCK_WIDTH = 16, this allows up to 8*512 = 4,096 pending loads In practice, there will probably be up to half of this due to scheduling to make use of SPs. The next BLOCK_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per Thread Block, allowing only up to two Thread Blocks active at the same time 6

Instruction Mix Considerations for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx]; There are very few mul/add between branches and address calculation. Loop unrolling can help. Pvalue += Ms[ty][k] * Ns[k][tx] + Ms[ty][k+15] * Ns[k+15][tx]; 7 More Work per Thread: 1xN Tiling One block computes N square sub-matrix P sub of size BLOCK_WIDTH One thread computes N element of P sub Reduced loads from global memory (M) to shared memory Reduced instruction overhead More work done in each iteration by 0 1 0 1 2 ty bsize-1 M N P bx 0 1 2 tx 012 bsize-1 P sub BLOCK_SIZE BLOCK_SIZE P sub BLOCK_SIZE M.height N.height 2 BLOCK_SIZE BLOCK_SIZE M.width BLOCK_SIZE N.width 8

Prefetching One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers Loop { Load next tile from global memory Load current tile to shared memory syncthreads() Compute current tile syncthreads() Loop { Deposit current tile to shared memory syncthread() Load next tile from global memory Compute current subblock syncthreads() 9 Some More Plausible Ideas One might be able to use texture memory for M accesses to reduce register usage Let us know if you get more than 120 GFLOPs (including CPU/GPU data transfers) for matrix multiplication 10

Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n elements [a 0, a 1,, a n-1 ] and returns the ordered set [I, a 0, (a 0 a 1 ),, (a 0 a 1 a n-2 )]. Example: if is addition, then scan on the set [3 1 7 0 4 1 6 3] returns the set [0 3 4 11 11 15 16 22] Exclusive scan: last input element is not included in the result (From Blelloch, 1990, Prefix 12 Sums and Their Applications)

Applications of Scan Scan is a simple and useful parallel building block Convert recurrences from sequential : for(j=1;j<n;j++) out[j] = out[j-1] + f(j); into parallel: forall(j) { temp[j] = f(j) ; scan(out, temp); Useful for many parallel algorithms: radix sort quicksort String comparison Lexical analysis Stream compaction Polynomial evaluation Solving recurrences Tree operations Histograms Etc. 13 Scan on the CPU void scan( float* scanned, float* input, int length) { scanned[0] = 0; for(int i = 1; i < length; ++i) { scanned[i] = input[i-1] + scanned[i-1]; Just add each element to the sum of the elements before it Trivial, but sequential Exactly n adds: optimal in terms of work efficiency 14

A First-Attempt Parallel Scan Algorithm 0 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 Each thread reads one value from the input array in device memory into shared memory array T0. Thread 0 writes 0 into shared memory array. 1. Read input from device memory to shared memory. Set first element to zero and shift others right by one. 15 A First-Attempt Parallel Scan Algorithm 0 In 3 1 7 0 4 1 6 3 1. (previous slide) Stride 1 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 2. Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) Iteration #1 Stride = 1 Active threads: stride to n-1 (n-stride threads) Thread j adds elements j and j-stride from T0 and writes result into shared memory buffer T1 (ping-pong) 16

A First-Attempt Parallel Scan Algorithm 0 Stride 1 Stride 2 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 T0 0 3 4 11 11 12 12 11 Iteration #2 Stride = 2 1. Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2. Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) 17 A First-Attempt Parallel Scan Algorithm 0 Stride 1 Stride 2 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 T0 0 3 4 11 11 12 12 11 T1 0 3 4 11 11 15 16 22 Iteration #3 Stride = 4 1. Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2. Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) 18

A First-Attempt Parallel Scan Algorithm 0 Stride 1 Stride 2 In 3 1 7 0 4 1 6 3 T0 0 3 1 7 0 4 1 6 T1 0 3 4 8 7 4 5 7 T0 0 3 4 11 11 12 12 11 T1 0 3 4 11 11 15 16 22 Out 0 3 4 11 11 15 16 22 1. Read input from device memory to shared memory. Set first element to zero and shift others right by one. 2. Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays) 3. Write output to device memory. 19 Code from CUDA SDK (naïve) global void scan_naive(float *g_odata, float *g_idata, int n) { // Dynamically allocated shared memory for scan kernels extern shared float temp[]; int thid = threadidx.x; int pout = 0; int pin = 1; // Cache the computational window in shared memory temp[pout*n + thid] = (thid > 0)? g_idata[thid-1] : 0; for (int offset = 1; offset < n; offset *= 2) { pout = 1 - pout; pin = 1 - pout; syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; if (thid >= offset) temp[pout*n+thid] += temp[pin*n+thid - offset]; syncthreads(); g_odata[thid] = temp[pout*n+thid]; Number of threads = n 20

Work Efficiency Considerations The first-attempt Scan executes log(n) parallel iterations The steps do (n/2 + n/2-1), (n/4+ n/2-1), (n/8+n/2-1),..(1+ n/2-1) adds each Total adds: n * (log(n) 1) + 1 O(n*log(n)) work This scan algorithm is not very work efficient Sequential scan algorithm does n adds A factor of log(n) hurts: 20x for 10^6 elements! A parallel algorithm can be slow when execution resources are saturated due to low work efficiency 21 Improving Efficiency A common parallel algorithm pattern: Balanced Trees Build a balanced binary tree on the input data and sweep it to and from the root Tree is not an actual data structure, but a concept to determine what each thread does at each step For scan: Traverse down from leaves to root building partial sums at internal nodes in the tree Root holds sum of all leaves Traverse back up the tree building the scan from the partial sums 22

Build the Sum Tree T 3 1 7 0 4 1 6 3 Assume array is already in shared memory 23 Build the Sum Tree T 3 1 7 0 4 1 6 3 Stride 1 Iteration 1, n/2 threads T 3 4 7 7 4 5 6 9 Each corresponds to a single thread. Iterate log(n) times. Each thread adds value stride elements away to its own value 24

Stride 1 Build the Sum Tree T 3 1 7 0 4 1 6 3 Stride 2 T 3 4 7 7 4 5 6 9 Iteration 2, n/4 threads T 3 4 7 11 4 5 6 14 Each corresponds to a single thread. Iterate log(n) times. Each thread adds value stride elements away to its own value 25 Stride 1 Stride 2 Build the Sum Tree T 3 1 7 0 4 1 6 3 T 3 4 7 7 4 5 6 9 Stride 4 T 3 4 7 11 4 5 6 14 T 3 4 7 11 4 5 6 25 Iteration log(n), 1 thread Each corresponds to a single thread. Iterate log(n) times. Each thread adds value stride elements away to its own value. Note that this algorithm operates in-place: no need for double buffering 26

Zero the Last Element T 3 4 7 11 4 5 6 0 We now have an array of partial sums. Since this is an exclusive scan, set the last element to zero. It will propagate back to the first element. 27 Build Scan From Partial Sums T 3 4 7 11 4 5 6 0 28

Stride 4 Build Scan From Partial Sums T 3 4 7 11 4 5 6 0 T 3 4 7 0 4 5 6 11 Iteration 1 1 thread Each corresponds to a single thread. Iterate log(n) times. Each thread adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. 29 Stride 4 Build Scan From Partial Sums T 3 4 7 11 4 5 6 0 Stride 2 T 3 4 7 0 4 5 6 11 T 3 0 7 4 4 11 6 16 Iteration 2 2 threads Each corresponds to a single thread. Iterate log(n) times. Each thread adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. 30

Stride 4 Stride 2 Build Scan From Partial Sums T 3 4 7 11 4 5 6 0 T 3 4 7 0 4 5 6 11 Stride 1 T 3 0 7 4 4 11 6 16 T 0 3 4 11 11 15 16 22 Iteration log(n) n/2 threads Each corresponds to a single thread. Done! We now have a completed scan that we can write out to device memory. Total steps: 2 * log(n). Total work: 2 * (n-1) adds = O(n) Work Efficient! 31 Code (CUDA SDK work-efficient) _global void scan_workefficient(float *g_odata, float *g_idata, int n){ // Dynamically allocated shared memory for scan kernels extern shared float temp[]; int thid = threadidx.x; int offset = 1; // Cache the computational window in shared memory temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; // build the sum in place up the tree for (int d = n>>1; d > 0; d >>= 1) { syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; offset *= 2; Number of threads = n/2 32

Code (Cont.) // scan back down the tree // clear the last element if (thid == 0) { temp[n - 1] = 0; // traverse down the tree building the scan in place for (int d = 1; d < n; d *= 2) { offset >>= 1; syncthreads(); if (thid < d){ int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; 33 Code (cont.) syncthreads(); // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1]; 34

Problem: Bank conflicit Each thread loads two shared mem data elements Tempting to interleave the loads temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; Threads:(0,1,2,,8,9,10, ) banks:(0,2,4,,0,2,4, ) Solution: padding See the best implementation in CUDA SDK. 35