Parallel Patterns Ezio Bartocci

Similar documents
GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Scan Algorithm Effects on Parallelism and Memory Conflicts

CS 677: Parallel Programming for Many-core Processors Lecture 6

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

CS 179: GPU Programming. Lecture 7

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Warps and Reduction Algorithms

Data Parallel Programming with Patterns. Peng Wang, Developer Technology, NVIDIA

CS377P Programming for Performance GPU Programming - II

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

Scan Primitives for GPU Computing

CS671 Parallel Programming in the Many-Core Era

GPU MEMORY BOOTCAMP III

Lecture 4: warp shuffles, and reduction / scan operations

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CS 193G. Lecture 4: CUDA Memories

CS 314 Principles of Programming Languages

WRITING DATA PARALLEL ALGORITHMS ON GPUs

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Lecture 11 Parallel Computation Patterns Reduction Trees

Optimizing Parallel Reduction in CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

GPU programming basics. Prof. Marco Bertini

Lecture 2: CUDA Programming

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Scan Primitives for GPU Computing

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

Global Memory Access Pattern and Control Flow

Computer Architecture

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Thrust. CUDA Course July István Reguly

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

More CUDA. Advanced programming

PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam

Update on Angelic Programming synthesizing GPU friendly parallel scans

Getting the Most out of your GPU. Imre Palik Introduction to Data-Parallel Algorithms. Imre Palik.

ECE 408 / CS 483 Final Exam, Fall 2014

Hardware/Software Co-Design

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Optimization Strategies Global Memory Access Pattern and Control Flow

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

GPGPU: Parallel Reduction and Scan

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Parallel Sorting Algorithms

Lecture 3: CUDA Programming & Compiler Front End

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA. Sathish Vadhiyar High Performance Computing

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Efficient Implementation of Reductions on GPU Architectures

Advanced CUDA Optimizations

Parallel Memory Defragmentation on a GPU

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Discrete Element Method

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Fast BVH Construction on GPUs

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Clustering Lecture 8: MapReduce

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Image convolution with CUDA

Example 1: Color-to-Grayscale Image Processing

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CUDA Parallel Prefix Sum (1A) Young W. Lim 4/5/14

GPGPUGPGPU: Multi-GPU Programming

Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups

Sparse Linear Algebra in CUDA

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

CUDA Parallel Programming Model Michael Garland

The MapReduce Abstraction

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Dense Linear Algebra. HPC - Algorithms and Applications


EEM528 GPU COMPUTING

CSL 730: Parallel Programming

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

CS179 GPU Programming Recitation 4: CUDA Particles

Plotting data in a GPU with Histogrammar

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

Transcription:

TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group Parallel Patterns Ezio Bartocci

Parallel Patterns Think at a higher level than individual CUDA kernels Specify what to compute, not how to compute it Let programmer worry about algorithm Defer pattern implementation to someone else

Common Parallel Computing Scenarios Many parallel threads need to generate a single result à Reduce Many parallel threads need to partition data à Split Many parallel threads produce variable output / thread à Compact / Expand

Reduction Reduce vector to a single value Via an associative operator (+, *, min/max, AND/OR, ) CPU: sequential implementation - for(int i = 0, i < n, ++i)... GPU: tree -based implementation 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25

Serial Reduction // reduction via serial iteration float sum(float *data, int n) { float result = 0; for(int i = 0; i < n; ++i) { } result += data[i]; } return result;

Parallel Reduction - Interleaved Values (in shared memory) 10 1 8-1 0-2 3 5-2 -3 2 7 0 11 0 2 Step 1 Stride 1 Thread IDs 0 1 2 3 4 5 6 7 Values 11 1 7-1 -2-2 8 5-5 -3 9 7 11 11 2 2 Step 2 Stride 2 Thread IDs 0 1 2 3 Values 18 1 7-1 6-2 8 5 4-3 9 7 13 11 2 2 Step 3 Stride 4 Thread IDs 0 1 Values 24 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2 Step 4 Stride 8 Thread IDs 0 Values 41 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2

Parallel Reduction - Contiguous Values (in shared memory) 10 1 8-1 0-2 3 5-2 -3 2 7 0 11 0 2 Step 1 Stride 8 Thread IDs 0 1 2 3 4 5 6 7 Values 8-2 10 6 0 9 3 7-2 -3 2 7 0 11 0 2 Step 2 Stride 4 Thread IDs 0 1 2 3 Values 8 7 13 13 0 9 3 7-2 -3 2 7 0 11 0 2 Step 3 Stride 2 Thread IDs 0 1 Values 21 20 13 13 0 9 3 7-2 -3 2 7 0 11 0 2 Step 4 Stride 1 Thread IDs 0 Values 41 20 13 13 0 9 3 7-2 -3 2 7 0 11 0 2

CUDA Reduction global void block_sum(float *input, float *results, size_t n) { extern shared float sdata[]; int i =..., int tx = threadidx.x; // load input into shared memory float x = 0; if(i < n) x = input[i]; sdata[tx] = x; syncthreads();

CUDA Reduction // block-wide reduction in shared mem for(int offset = blockdim.x / 2; { } offset > 0; offset >>= 1) if(tx < offset) { } // add a partial sum upstream to our own sdata[tx] += sdata[tx + offset]; syncthreads();

CUDA Reduction // finally, thread 0 writes the result If (threadidx.x == 0) { } } // note that the result is per-block // not per-thread results[blockidx.x] = sdata[0];

An Aside // is this barrier divergent? for(int offset = blockdim.x / 2; { } offset > 0; offset >>= 1)... syncthreads();

An Aside // what about this one? global void do_i_halt(int *input) { } int i = if(input[i]) { }... syncthreads(); // a divergent barrier // hangs the machine

CUDA Reduction // global sum via per-block reductions float sum(float *d_input, size_t n) { size_t block_size =..., num_blocks =...; // allocate per-block partial sums // plus a final total sum float *d_sums = 0; cudamalloc((void**)&d_sums, sizeof(float) * (num_blocks + 1));...

CUDA Reduction // reduce per-block partial sums int smem_sz = block_size*sizeof(float); block_sum<<<num_blocks,block_size,smem_sz>>> (d_input, d_sums, n); // reduce partial sums to a total sum block_sum<<<1,block_size,smem_sz>>> (d_sums, d_sums + num_blocks, num_blocks); // copy result to host float result = 0; cudamemcpy(&result, d_sums+num_blocks,...); return result; }

Caveat Reductor What happens if there are too many partial sums to fit into shared memory in the second stage? What happens if the temporary storage is too big? Give each thread more work in the first stage Sum is associative & commutative Order does not matter to the result We can schedule the sum any way we want à serial accumulation before block-wide reduction Exercise left to the hacker

Parallel Reduction Complexity Log(N) parallel steps, each step S does N/2S independent ops Step Complexity is O(log N) For N=2D, performs 2 D S = N 1 operations S 1..D [ ] Work Complexity is O(N) It is work-efficient i.e. does not perform more operations than a sequential algorithm With P threads physically in parallel (P processors), time complexity is O(N/P + log N) Compare to O(N) for sequential reduction

F T F F T F F T F F F F F T T T 3 6 1 4 0 7 1 3 3 1 4 7 1 6 0 3 Flag Payload Given:array of true and false elements (and payloads) Return an array with all true elements at the beginning Examples: sorting, building trees Split Operation

Variable Output Per Thread: Compact Remove null elements 3 0 7 0 4 1 0 3 3 7 4 1 3 Example: collision detection

Variable Output Per Thread: General Case Reserve Variable Storage Per Thread 2 1 0 3 2 A C D G B E H F Example: binning

Split, Compact, Expand Each thread must answer a simple question: Where do I write my output? The answer depends on what other threads write! Scan provides an efficient parallel answer

Scan (a.k.a. Parallel Prefix Sum) Given an array A = [a 0, a 1,, a n-1 ] and a binary associative operator + with identity I, scan(a) = [I, a 0, (a 0 + a 1 ),, (a 0 + a 1 + + a n-2 )] Prefix sum: if + is addition, then scan on the series 3 1 7 0 4 1 6 3 0 3 4 11 11 15 16 22 returns the series

Applications of Scan Scan is a simple and useful parallel building block for many parallel algorithms: Radix sort Quicksort (seg. scan) String comparison Lexical analysis Stream compaction Run-length encoding Polynomial evaluation Solving recurrences Tree operations Histograms Allocation Etc. Fascinating, since scan is unnecessary in sequential computing!

Serial int input[8] = {3, 1, 7, 0, 4, 1, 6, 3}; int result[8]; int running_sum = 0; for(int i = 0; i < 8; ++i) { } result[i] = running_sum; running_sum += input[i]; // result = {0, 3, 4, 11, 11, 15, 16, 22}

A Scan Algorithm - Preview 3 1 7 0 4 1 6 3 Assume array is already in shared memory See Harris, M., S. Sengupta, and J.D. Owens. Parallel Cyber-Physical-Systems Prefix Sum (Scan) in CUDA, GPU Group Gems 3

A Scan Algorithm - Preview 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 Iteration 0, n-1 threads Each corresponds to a single thread. Iterate log(n) times. Each thread adds value stride elements away to its own value

A Scan Algorithm - Preview 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 3 4 11 11 12 12 11 14 Iteration 1, n-2 threads Each corresponds to a single thread. Iterate log(n) times. Each thread adds value offset elements away to its own value

A Scan Algorithm - Preview 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 3 4 11 11 12 12 11 14 3 4 11 11 15 16 22 25 Iteration i, n-2 i threads Each corresponds to a single thread. Iterate log(n) times. Each thread adds value offset elements away to its own value. Note that this algorithm operates in-place: no need for double buffering

A Scan Algorithm - Preview 3 4 11 11 15 16 22 25 We have an inclusive scan result

A Scan Algorithm - Preview 3 4 11 11 15 16 22 25? 0 0 3 4 11 11 15 16 22 For an exclusive scan, right-shift through shared memory Note that the unused final element is also the sum of the entire array Often called the carry Scan & reduce in one pass

CUDA Block-wise Inclusive Scan global void inclusive_scan(int *data) { extern shared int sdata[]; unsigned int i =... // load input into shared memory int sum = input[i]; sdata[threadidx.x] = sum; syncthreads();...

CUDA Block-wise Inclusive Scan for (int o = 1; o < blockdim.x; o <<= 1) { if(threadidx.x >= o) sum += sdata[threadidx.x - o]; // wait on reads syncthreads(); // write my partial sum sdata[threadidx.x] = sum; } // wait on writes syncthreads();

CUDA Block-wise Inclusive Scan } // we're done! // each thread writes out its result result[i] = sdata[threadidx.x];

Results are Local to Each Block Block 0 Input: 5 5 4 4 5 4 0 0 4 2 5 5 1 3 1 5 Result: 5 10 14 18 23 27 27 27 31 33 38 43 44 47 48 53 Block 1 Input: 1 2 3 0 3 0 2 3 4 4 3 2 2 5 5 0 Result: 1 3 6 6 9 9 11 14 18 22 25 27 29 34 39 39

Results are Local to Each Block Need to propagate results from each block to all subsequent blocks 2-phase scan 1. Per-block scan & reduce 2. Scan per-block sums Final update propagates phase 2 data and transforms to exclusive scan result

Summing Up Patterns like reduce, split, compact, scan, and others let us reason about data parallel problems abstractly Higher level patterns are built from more fundamental patterns Scan in particular is fundamental to parallel processing, but unnecessary in a serial world

SEGMENTED SCAN

Segmented Scan What it is: Scan + Barriers/Flags associated with certain positions in the input arrays Operations don t propagate beyond barriers Do many scans at once, no matter their size

Image taken from Efficient parallel scan algorithms for GPUs by S. Sengupta, M. Harris, and M. Cyber-Physical-Systems Garland Group

global void segscan(int * data, int * flags) { shared int s_data[bl_size]; shared int s_flags[bl_size]; int idx = threadidx.x + blockdim.x * blockidx.x; // copy block of data into shared // memory s_data[idx] = ; s_flags[idx] = ; syncthreads(); Segmented Scan

Segmented Scan // choose whether to propagate s_data[idx] = s_flags[idx]? s_data[idx] : s_data[idx - 1] + s_data[idx]; } // create merged flag s_flags[idx] = s_flags[idx - 1] s_flags[idx]; // repeat for different strides

Segmented Scan Doing lots of reductions of unpredictable size at the same time is the most common use Think of doing sums/max/count/any over arbitrary subdomains of your data

Segmented Scan Common Usage Scenarios: Determine which region/tree/group/object class an element belongs to and assign that as its new ID Sort based on that ID Operate on all of the regions/trees/groups/objects in parallel, no matter what their size or number

Segmented Scan Also useful for implementing divide-and-conquer type algorithms Quicksort and similar algorithms

SORT

Sort Useful for almost everything Optimized versions for the GPU already exist Sorted lists can be processed by segmented scan Sort data to restore memory and execution coherence

Sort binning and sorting can often be used interchangeably Sort is standard, but can be suboptimal Binning is usually custom, has to be optimized, can be faster

Sort Radixsort is faster than comparison-based sorts If you can generate a fixed-size key for the attribute you want to sort on, you get better performance

MAPREDUCE

Mapreduce Old concept from functional progamming Repopularized by Google as parallel computing pattern Combination of sort and reduction (scan)

Mapreduce Image taken from Jeff Dean s presentation at http:// labs.google.com/papers/mapreduce-osdi04-slides/ index-auto-0007.html

Mapreduce: Map Phase Map a function over a domain Function is provided by the user Function can be anything which produces a (key, value) pair Value can just be a pointer to arbitrary datastructure

Mapreduce: Sort Phase All the (key,value) pairs are sorted based on their keys Happens implicitly Creates runs of (k,v) pairs with same key User usually has no control over sort function

Mapreduce: Reduce Phase Reduce function is provided by the user Can be simple plus, max, Library makes sure that values from one key don t propagate to another (segscan) Final result is a list of keys and final values (or arbitrary datastructures)

KERNEL FUSION

Kernel Fusion Combine kernels with simple producer->consumer dataflow Combine generic data movement kernel with specific operator function Save memory bandwidth by not writing out intermediate results to global memory

global void is_even(int * in, int * out) { } int i = out[i] = ((in[i] % 2) == 0)? 1: 0; global void scan( ) { } Separate Kernels

global void fused_even_scan(int * in, int * out, ) { } Fused Kernel int i = int flag = ((in[i] % 2) == 0)? 1: 0; // your scan code here, using the flag directly

Fused Kernel Best when the pattern looks like output[i] = g(f(input[i])); Any simple one-to-one mapping will work

template <class F> global void opt_stencil(float * in, float * out, F f) { // your 2D stencil code here for(i,j) } { Fused Kernel partial = f(partial,in[ ],i,j); } float result = partial;

class boxfilter { private: } table[3][3]; boxfilter(float input[3][3]) public: float operator()(float a, float b, int i, int j) { } Fused Kernel return a + b*table[i][j];

class maxfilter { public: } float operator()(float a, float b, int i, int j) { } return max(a,b); Fused Kernel

Example Segmented Scan int data[10] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; int flags[10] = {0, 0, 0, 1, 0, 1, 1, 0, 0, 0}; int step1[10] = {1, 2, 1, 1, 1, 1, 1, 2, 1, 2}; int flg1[10] = {0, 0, 0, 1, 0, 1, 1, 1, 0, 0}; int step2[10] = {1, 2, 1, 1, 1, 1, 1, 2, 1, 2}; int flg2[10] = {0, 0, 0, 1, 0, 1, 1, 1, 0, 0};

Example Segmented Scan int step2[10] = {1, 2, 1, 1, 1, 1, 1, 2, 1, 2}; int flg2[10] = {0, 0, 0, 1, 0, 1, 1, 1, 0, 0}; int result[10] = {1, 2, 3, 1, 2, 1, 1, 2, 3, 4};