CS516 Programming Languages and Compilers II

Similar documents
CS 314 Principles of Programming Languages

CS671 Parallel Programming in the Many-Core Era

Lecture 2: CUDA Programming

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Tesla Architecture, CUDA and Optimization Strategies

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA. Sathish Vadhiyar High Performance Computing

Optimizing Parallel Reduction in CUDA

Real-time Graphics 9. GPGPU

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Real-time Graphics 9. GPGPU

CUDA Performance Optimization Mark Harris NVIDIA Corporation

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

NVIDIA GPU CODING & COMPUTING

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Lecture 3: CUDA Programming & Compiler Front End

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

High Performance Linear Algebra on Data Parallel Co-Processors I

Practical Introduction to CUDA and GPU

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CS516 Programming Languages and Compilers II

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Optimizing Matrix Transpose in CUDA. Greg Ruetsch. Paulius Micikevicius. Tony Scudiero.

CME 213 S PRING Eric Darve

CS377P Programming for Performance GPU Programming - I

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Lecture 10!! Introduction to CUDA!

GPU programming. Dr. Bernhard Kainz

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

COSC 6374 Parallel Computations Introduction to CUDA

CS : High-Performance Computing with CUDA

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

ECE 408 / CS 483 Final Exam, Fall 2014

Introduc)on to CUDA. T.V. Singh Ins)tute for Digital Research and Educa)on UCLA

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

Introduction to CUDA

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 11: GPU programming

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

Lecture 5. Performance Programming with CUDA

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Scientific discovery, analysis and prediction made possible through high performance computing.

Computer Architecture

ECE 574 Cluster Computing Lecture 17

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 3: Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

Parallel Computing. Lecture 19: CUDA - I

ECE 574 Cluster Computing Lecture 15

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPGPUs and to CUDA programming model

GPU Programming Using CUDA

Advanced CUDA Optimizations

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Module 2: Introduction to CUDA C. Objective

Introduction to Parallel Computing with CUDA. Oswald Haan

Programming with CUDA, WS09

CS179 GPU Programming Recitation 4: CUDA Particles

Mathematical computations with GPUs

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Module 2: Introduction to CUDA C

Module 3: CUDA Execution Model -I. Objective

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Supercomputing on your desktop:

Programming in CUDA. Malik M Khan

CUDA Parallel Programming Model Michael Garland

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks

Parallel Numerical Algorithms

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Writing and compiling a CUDA code

CUDA Parallelism Model

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA Architecture & Programming Model

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CS 677: Parallel Programming for Many-core Processors Lecture 6

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Warps and Reduction Algorithms

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary

Transcription:

CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 29 GPU Programming II Rutgers University

Review: Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations to GPU * make GPU work as a data-parallel computing device. * need no explicit mapping from app. to a graphics API * largest user base and mature performance tuning experience * similar to other general purpose GPU programming framework

Review: CUDA Programming Model Terminology Device: the GPU (worker) Capable of executing a large number of threads in parallel Host: the CPU (commander) Send workload to and communicate with GPU Kernel: a function to be run on the device The actual code that runs on GPU 3

Review: Programming with CUDA The Example in C Add vector A and vector B to vector C. void add_vector (float *A, float *B, float *C, in N){...... for ( index = 0; index < N; index++ ) C[index] = A[index] + B[index];......} void main ( ) {... add_vector (A1,B1,C1,N1);...}

Review: Programming with CUDA The Example in CUDA Add vector A and vector B to vector C. _global_ void kernel function to run on GPU add_vector (float *A, float *B, float *C, in N){...... C[tid] = A[tid] + B[tid];......} main function to run on CPU void main ( ) {... add_vector <<<dimgrid, dimblock>>> (A1,B1,C1,N1);...}

Review: CUDA Thread View add_vector <<<dimgrid, dimblock>>> (A1,B1,C1,N1); One or more blocks are assigned to a multiprocessor. Threads in a block is organized in warps. (32 threads/ warp) A (half) warp runs in a SIMD fashion.

Review: SIMD (Single Instruction Multiple Data) C[index] = A[index] + B[index]; thread 0: C[0] = A[0] + B[0]; thread 1: C[1] = A[1] + B[1];... thread 15: C[15] = A[15] + B[15]; Execute at the same time.

Performance Hazard 1: SIMD Divergence Eliminate as much control divergence as possible control flow (thread divergence) tid: 0 1 2 3 4 5 6 7 if (A[tid]) {...} A[ ]: 0 0 6 0 0 2 4 1 for (i=0;i<a[tid]; i++) {...} Degrade throughput by up to warp size times. (warp size = 32 in modern GPUs)

CUDA Memory Model On-chip Registers, shared memory (scratchpad), I cache, C cache, texture cache Accessing speed are similar Off-chip (all types share the GDDR memory space) Local memory: register spilling, activation record for function calls Constant memory: read through a constant cache per SM, optimized for broadcast reads. Texture memory: read through texture cache, optimized for spatially related reads, with dedicated hardware as filters (hardware cache)

CUDA Memory View r/w registers/thread (compiler) r/w local mem/thread (compiler) r/w shared mem/block (programmer and/or compiler) r/w global mem/grid (programmer and/or compiler) r constant mem/grid (programmer and/or compiler) r texture mem/grid (programmer and/or compiler)

CUDA Shared Memory (Scratch-pad Memory) Typically 16 or 32 banks Successive 32-bit words are assigned to successive banks (16bits/cycle per bank) A thread accesses strided instead of contiguous array elements Can also be used a type of cache software cache, placing the hot/frequent data objects without conflicts with conflicts Performance Hazard II: Bank Conflicts

CUDA Global Memory (GDDR memory) Data is fetched in a chunk. Loading/storing such a chunk is called a transaction. A transaction can be 32 bytes, 64 bytes or 128 bytes. A thread warp needs to fetch data for all of its threads before it can run Make data for one thread warp fit into as few memory transactions as possible is called memory coalescing... = A[P[tid]]; memory P[ ] = { 0, 5, 1, 1, 2, 7, 3, 4, 3, 5, 6, 2} 7} tid: 0 1 2 3 4 5 6 7 A[ ]: { a mem chunk. Performance Hazard III: Non-coalesced Memory Access

Example: Matrix Transpose

Matrix Transpose Code Simple implementation. global void transpose_naive(float *odata, float* idata, int width, int height) { unsigned int xindex = blockdim.x * blockidx.x + threadidx.x; unsigned int yindex = blockdim.y * blockidx.y + threadidx.y; } if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; odata[index_out] = idata[index_in]; } What is the performance hazard here? 14

Matrix Transpose Code Simple implementation. global void transpose_naive(float *odata, float* idata, int width, int height) { unsigned int xindex = blockdim.x * blockidx.x + threadidx.x; unsigned int yindex = blockdim.y * blockidx.y + threadidx.y; } if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; odata[index_out] = idata[index_in]; } memory write access not coalesced 15

Example: Matrix Transpose Simple implementation. A thread corresponds to a cell in the matrix

Matrix Transpose Code global void transpose(float *odata, float *idata, int width, int height){ shared float block[block_dim][block_dim+1]; // read the matrix tile into shared memory unsigned int xindex = blockidx.x * BLOCK_DIM + threadidx.x; unsigned int yindex = blockidx.y * BLOCK_DIM + threadidx.y; if((xindex < width) && (yindex < height)) { unsigned int index_in = yindex * width + xindex; block[threadidx.y][threadidx.x] = idata[index_in]; } syncthreads(); // write the transposed matrix tile to global memory xindex = blockidx.y * BLOCK_DIM + threadidx.x; yindex = blockidx.x * BLOCK_DIM + threadidx.y; if((xindex < height) && (yindex < width)){ unsigned int index_out = yindex * height + xindex; odata[index_out] = block[threadidx.x][threadidx.y];} } memory access coalesced 17 Efficient implementation.

Example: Matrix Transpose Efficient implementation. coalesced read coalesced write in global memory in shared memory in global memory

Application Programming Interface Four-fold extensions to C language Function type qualifiers device : run on device, callable from device global : run on device, callable from host host : run on host, callable from host. (default) Variable type qualifiers constant : in constant memory shared : in shared memory device : in device memory (default on global mem.) Execution configuration for global functions global void Func (float* parameter); Func<<<Dg, Db, Ns>>>(param); Built-in variables griddim, blockidx, blockdim, threadidx

Built-in Vector Types char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4 e.g. float4 pos = make_float4(a, b, c,d); v = pos.x + pos.y + pos.z + pos.w; dim3 type e.g. dim3 blocksize(xsize, YSIZE); unspecified dimensions are 1.

The Matrix Transpose Code on Host Side #define BLOCK_DIM 16 int main(){ int size_x=32; int size_y=128; int mem_size = size_x*size_y; float* h_idata, *h_odata, * d_idata, * d_odata; h_idata = (float*) malloc(mem_size); h_odata = (float*) malloc(mem_size); cudamalloc( (void**) &d_idata, mem_size); //allocate on device cudamalloc( (void**) &d_odata, mem_size); //allocate on device //copy data from host to device cudamemcpy( d_idata, h_idata, mem_size,cudamemcpyhosttodevice); //set up execution configuration dim3 threads(block_dim, BLOCK_DIM, 1); dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1); //call kernel function transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); cudathreadsynchronize(); printf ( Transpose kernel is done.\n ); //copy results from device to host cudamemcpy( h_odata, d_odata, mem_size, cudamemcpydevicetohost);}

Compile and Run Compiler: nvcc nvcc -c main.cu -o kernel.cu_o -I/usr/local/cuda/include -I... g++ -o foo kernel.cu_o -L/usr/local/cuda/lib -L... Run foo arg1 arg2... Debug CUDA-GDB printf() the classical approach! nvcc -deviceemu... generate code to run on emulator Lots of code samples at: /ilab/users/zz124/cs516_2015/samples/

20 min break

Parallel Primitives on GPU Parallel Reduction Parallel Scan (Prefix Sum and Segment Scan) Parallel Sort

Parallel Reduction What is a reduction operation? (think about map reduce) The sequential code for sum of an array int sum = 0; for ( int i = 0; i < N; i++ ) sum = sum + A[i]; sum: 0 3 4 11 11 15 16 22 25

Basic Idea Parallel Reduction (Summation) parallel iteration 1 parallel iteration 2 parallel iteration 3

Parallel Reduction (Summation) Initial Implementation s=1 s=2 s=4 s=8 for (unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s]; syncthreads(); }

Parallel Reduction (Summation) Initial Implementation s=1 s=2 s=4 s=8 What is the performance hazard here?

Review: SIMD (Single Instruction Multiple Data) C[index] = A[index] + B[index]; thread 0: C[0] = A[0] + B[0]; thread 1: C[1] = A[1] + B[1];... thread 15: C[15] = A[15] + B[15]; Execute at the same time.

Parallel Reduction (Summation) Initial Implementation s=1 s=2 s=4 s=8 Control Divergence!!!

Parallel Reduction (Summation) Removed control divergence for (unsigned int s=1; s < blockdim.x; s *= 2) { int index = 2 * s * tid; if (index < blockdim.x) sdata[index] += sdata[index + s]; syncthreads(); }

Parallel Reduction (Summation) Removed divergence What is the performance hazard now? Potential shared memory bank conflicts

Review: CUDA Shared Memory Bank Conflicts Typically 16 or 32 banks Successive 32-bit words are assigned to successive banks (16bits/cycle per bank) A thread accesses strided instead of contiguous array elements Can also be used a type of cache software cache, placing the hot/frequent data objects without conflicts with conflicts

Parallel Reduction (Summation) Third version -- sequential addressing for (unsigned int s=blockdim.x/2; s>0; s>>=1) { if (tid < s) sdata[tid] += sdata[tid + s]; syncthreads(); } Any possible further improvement?

Parallel Reduction (Summation) Fourth version reduce number of conditional checks and thread sync for warps for (unsigned int s=blockdim.x/2; s>32; s>>=1) { if (tid < s) sdata[tid] += sdata[tid + s]; syncthreads(); } if (tid < 32) { } sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid + 1]; for (unsigned int s=blockdim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } syncthreads(); } Fifth version halve the threads at the first iteration unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blockdim.x*2) + threadidx.x; sdata[tid] = g_idata[i] + g_idata[i+blockdim.x]; syncthreads();

Parallel Reduction Summation Sixth version -- code specialization with template if (blocksize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } syncthreads(); } if (blocksize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } syncthreads(); } if (blocksize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } syncthreads(); } if (tid < 32) { if (blocksize >= 64) sdata[tid] += sdata[tid + 32]; if (blocksize >= 32) sdata[tid] += sdata[tid + 16]; if (blocksize >= 16) sdata[tid] += sdata[tid + 8]; if (blocksize >= 8) sdata[tid] += sdata[tid + 4]; if (blocksize >= 4) sdata[tid] += sdata[tid + 2]; if (blocksize >= 2) sdata[tid] += sdata[tid + 1]; } code in red evaluated in compile-time

Parallel Reduction (Summation) There is a 7-th version that further optimizes it! Read: Optimization Parallel Reduction in CUDA by Mark Harris, NVIDIA

A Question for You Parallel Segment Sum tid 0 1 2 3 4 5 6 7 threads ind[ ] A[ ] 0 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 SegSum[ ] 1 3 4

Sequential Prefix Sum Parallel Prefix Sum int PrefixSum[N]; PrefixSum[0] = 0; for ( i = 1; i < N; i++ ) PrefixSum[i] = PrefixSum[i-1] + A[i-1]; PrefixSum: [0 3 4 11 11 15 16 22]

Parallel Prefix Sum Initial parallel implementation [ Hillis and Steele 1986 ] Complexity: O(n log(n)) n is the total number of elements in this array

Parallel Prefix Sum Work efficient implementation (A). The Up-Sweep (Reduce) Phase Complexity: O(n) [ Blelloch 1990 ]

Parallel Prefix Sum Work efficient implementation (B). The Down-Sweep Phase Complexity: O(n)

Parallel Prefix Sum Shared memory bank conflicts padding

Parallel Sort Bitonic sort A bitonic sequence is defined as a list with no more than one LOCAL MAXIMUM and no more than one LOCAL MINIMUM. Binary split: Divide the bitonic list into two equal halves. Compare-exchange each item on the first half with the corresponding item in the second half. Result: Two bitonic sequences where the numbers in one sequence are all less than the numbers in the other sequence.

Parallel Sort Bitonic sort: many steps of bitonic split Complexity: O(n log(n) 2 ) Parallel steps: O(log(n) 2 )

Parallel Sort Bitonic sort code in CUDA global static void bitonicsort(int * values) { extern shared int shared[]; const unsigned int tid = threadidx.x; // Copy input to shared mem. shared[tid] = values[tid]; syncthreads(); // Parallel bitonic sort. for (unsigned int k = 2; k <= NUM; k *= 2) { // Bitonic merge: for (unsigned int j = k / 2; j>0; j /= 2) { unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); } else { if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } } syncthreads(); } } // Write result. values[tid] = shared[tid]; }

Parallel Sort Merge Sort (can also utilize bitonic split) Every node may corresponds to one processor/core

Parallel Sort Radix Sort 1. Sort for every digit from least significant to most significant bit or the other way around. No order change for the data elements in previous sorted groups during the shuffling phase. 2. Parallel prefix sum fits nicely in this framework. Every block of threads count the frequency of different digit values and their corresponding location in the same digit value group. Designing efficient sorting algorithms for manycore GPUs ( IPDPS 09 )

Reference (1). Designing efficient sorting algorithms for manycore GPUs ( IPDPS 09 ) Radix sort and merge sort. Used prefix sum extensively. (2). Fast parallel GPU-sorting using a hybrid algorithm ( JPDC 08 ). Combination of bucket sort and vector merge sort (3). GPU-ABiSort: optimal parallel sorting on stream architectures ( IPDPS 06 ) Complexity: O((n log(n)/p) on with up to p=n/log(n) streaming processors (4). Blelloch 1990. "Prefix Sums and Their Applications." Technical Report CMU- CS-90-190, School of Computer Science, Carnegie Mellon University. (5). Hillis and Steele 1986. "Data Parallel Algorithms." Communications of the ACM 29(12), pp. 1170 1183.

Next Class Resource Allocation (Compiler Middle-end and Back-end): * SSA * Register and Memory Coloring