Overview: Graphics Processing Units

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Computer Architecture

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA (1 of n*)

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Introduction to CUDA Programming

CSE 160 Lecture 24. Graphical Processing Units

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA Architecture & Programming Model

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

GPU programming. Dr. Bernhard Kainz

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 11: GPU programming

CUDA Programming Model

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CS 179 Lecture 4. GPU Compute Architecture

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

University of Bielefeld

Introduction to GPU hardware and to CUDA

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Cartoon parallel architectures; CPUs and GPUs

High Performance Linear Algebra on Data Parallel Co-Processors I

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

OpenMP and GPU Programming

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

ECE 574 Cluster Computing Lecture 15

Introduction to GPGPUs and to CUDA programming model

CUDA Performance Optimization. Patrick Legresley

Warps and Reduction Algorithms

Exotic Methods in Parallel Computing [GPU Computing]

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to CUDA (1 of n*)

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Portland State University ECE 588/688. Graphics Processors

GRAPHICS PROCESSING UNITS

Introduction to GPGPU and GPU-architectures

Spring Prof. Hyesoon Kim

Matrix Multiplication in CUDA. A case study

Lecture 5. Performance Programming with CUDA

CS377P Programming for Performance GPU Programming - II

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Lecture 2: CUDA Programming

Introduction to CUDA

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Dense Linear Algebra. HPC - Algorithms and Applications

Programming in CUDA. Malik M Khan

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Parallel Programming Concepts. GPU Computing with OpenCL

Real-time Graphics 9. GPGPU

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Fundamental Optimizations

CUDA C Programming Mark Harris NVIDIA Corporation

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Analyzing CUDA Workloads Using a Detailed GPU Simulator

CUDA Performance Considerations (2 of 2)

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Threading Hardware in G80

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

ECE 574 Cluster Computing Lecture 17

Advanced CUDA Programming. Dr. Timo Stich

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Hands-on CUDA Optimization. CUDA Workshop

CS516 Programming Languages and Compilers II

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

B. Tech. Project Second Stage Report on

ECE 408 / CS 483 Final Exam, Fall 2014

Real-time Graphics 9. GPGPU

Device Memories and Matrix Multiplication

Paralization on GPU using CUDA An Introduction

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to CUDA

GPU CUDA Programming

Transcription:

advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply memories, thread synchronization, scheduling case study: reductions performance considerations: bandwidth, scheduling, resource conflicts, instruction mix host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs the OpenCL programming model directive-based programming models refs: CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L21,22: Graphics Processing Units 2017 1

Advent of General-purpose Graphics Processing Units many applications have massive amounts of mostly independent calculations e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV can be largely expressed in terms of SIMD operations implementable with minimal control logic & caches, simple instruction sets design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore s Law (shown here) put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) pre 2006: only specialized graphics operations (integer & float data) 2006: General Purpose (GPGPU): general computations but only through a graphics library (e.g. OpenGL) 2009: programmable for general (numeric) calculations (e.g. CUDA, OpenCL) Some applications have large speedups (10 500 ) over a single CPU core. COMP4300/8300 L21,22: Graphics Processing Units 2017 2

Graphics Processor Unit Systems GPU systems are a co-processor device on a CPU-based system ([O H.&Bryant, fig 1.4]) separate memory space (DRAMs) for CPU (host) and GPU (device) must allocate space on GPU and copy data from CPU memory to GPU memory (and visa versa) via the PCIe bus also need a way to copy the GPU executable code and start it (kernel launch) issues? Why not use the same memory space? COMP4300/8300 L21,22: Graphics Processing Units 2017 3

Graphics Processor Unit Architecture GPU chip: an array of streaming multiprocessors (SMs) sharing an L2 cache comparison with UltraSPARC T2 (courtesy Real World Tech) each SM has (8 32) streaming processors (SPs) only SPs (= cores) within an SM can (easily) synchronize, share data identical threads are organized into fixed-size blocks, each allocated to an SM blocks in turn are divided into warps at any timestep, all SPs execute an instruction from a warp ( SIMT mode) latencies hidden by scheduling from many warps TeslaS2050 co-processor TeslaS2050 architecture (courtesy NVIDIA) COMP4300/8300 L21,22: Graphics Processing Units 2017 4

The Fermi Graphics Processor Chip GF110 model: 1.15 GHz; 900W; 3D grid & thread blocks; warp size: 32; max resident: blocks 8, warps 32, threads 1536 (from NCI NF page) COMP4300/8300 L21,22: Graphics Processing Units 2017 5

GPU vs CPU Floating Point Speed and Memory Bandwidth COMP4300/8300 L21,22: Graphics Processing Units 2017 6

The Common Unified Device Architecture Programming Model device refers to a co-processor with own DRAM that can run many threads in parallel host performs serial execution, transfers data to/from device (via DMA), and sends (highly ) kernels to device the kernel s threads are organized into a grid of blocks each block is sent to an SM a CUDA program is a C/C++ program with device calls & kernels (each with many threads) embedded into it GPU threads are very lightweight (some overheads in invoking a kernel, and dispatching each block) threads are identical but have thread (& block) ids (courtesy NCSU) CUDA compiler (e.g. nvcc) produces a normal executable with device code embedded into it has CUDA runtime (cudart) and core (cuda) libraries linked into it COMP4300/8300 L21,22: Graphics Processing Units 2017 7

CUDA Program: Simple Example reverse an array (reversearray.cu) g l o b a l void reversearray ( int a d, int N ) { int idx = threadidx. x ; int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; } # define N (1<<16) int main () { // may not dereference a d! int a [ N ], a d, a s i z e = N sizeof ( int );... cudamalloc (( void ) & a d, a s i z e ); cudamemcpy ( a d, a, a size, cudamemcpyhosttodevice ); reversearray <<<1, N/2>>> ( a d, N ); cudathreadsynchronize (); // wait till threads finish cudamemcpy (a, a d, a size, cudamemcpydevicetohost ); cudafree ( a d );... } cf. OpenMP on a normal multicore: style; practicality? # pragma omp parallel n u m t h r e a d s ( N /2) default ( shared ) { int idx = o m p g e t t h r e a d s n u m (); int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; } COMP4300/8300 L21,22: Graphics Processing Units 2017 8

CUDA Thread Organization and Memory Model a 2 1 grid with 2 1 blocks memory model (left) reflects that of the GPU 2 2 grid with 4 2 2 blocks (courtesy Real World Tech.) (courtesy NCSC) COMP4300/8300 L21,22: Graphics Processing Units 2017 9

Case Study: Matrix Multiply perform C+=AB, C is N N, A is N K, B is K N column-major storage: C i j is at C[i + j N] 1st attempt: each thread computes one element of C, C i, j invocation with W W thread blocks (assume W N) why better than using a N N thread block? (2 reasons, both important!) for thread (t x,t y ) of block (b x,b y ), i = b y W +t y and j = b x W +t x (courtesy xfig) COMP4300/8300 L21,22: Graphics Processing Units 2017 10

kernel: } CUDA Matrix Multiply: Implementation g l o b a l void matmult ( int N, int K, double A d, double B d, double C d ) { int i = blockidx. y blockdim. y + threadidx. y ; int j = blockidx. x blockdim. x + threadidx. x ; double cij = C d [ i + j N ]; for ( int k =0; k < K ; k ++) cij += A d [ i + k N ] B d [ k + j K ]; C d [ i + j N ] = cij ; main program: needs to allocate device versions of A, B & C (A d, B d, and C d) and cudamemcpy() host versions into them invocation with W W thread blocks (assume W N) dim3 dimg ( N /W, N / W ); dim3 dimb (W, W ); // in kernel blockdim. x == W matmult <<<dimg, dimb >>> (N, K, A d, B d, C d ); what if N % W > 0? Add to kernel if (i < N && j < N) and declare dim3 dimg((n+w 1)/W, (N+W 1)/W); note: SIMD nature of SPs cycles for both branches of if are consumed COMP4300/8300 L21,22: Graphics Processing Units 2017 11

CUDA Memories and Thread Synchronization GPUs can potentially suffer more still from the memory wall DRAM access still may be 100 s of cycles bandwidth is limited for load/store intensive kernels the shared memory is on-chip (hence very fast) the shared type modifier may be used to denote a (fixed) array allocated to shared memory threads within a block can synchronized via the (efficient why?) syncthreads() intrinsic (SM-level) atomic instructions can enforce data consistency within a block note: no way to synchronize between blocks, or safely ensure data consistency across blocks can only be done across separate kernel invocations COMP4300/8300 L21,22: Graphics Processing Units 2017 12

Matrix Multiply Using Shared Memory threads (t x,0)... (t x,w 1) all access B k,bx W +t x ; ((0,t y )... (W 1,t y ) access A by W +t y,k) high ratio of load to FP instructions harder to hide L1 cache latencies; strains memory bandwidth can improve kernel by utilizing SM shared memory: } s h a r e d double A s [ W ][ W ], B s [ W ][ W ]; g l o b a l void m a t M u l t s ( int N, int K, double A d, double B d, double C d ) { int ty = threadidx.y, tx = threadidx. x ; int i = blockidx. y W + ty, j = blockidx. x W + tx ; double cij = C d [ i + j N ]; for ( int k =0; k < K ; k += W ) { A s [ ty ][ tx ] = A d [ i + ( k + tx ) N ]; B s [ ty ][ tx ] = B d [( k + ty ) + j K ]; s y n c t h r e a d s (); for ( int w =0; w < W ; w ++) cij += A s [ ty ][ w ] B s [ w ][ tx ]; s y n c t h r e a d s (); // can this be avoided? } C d [ i + j N ] = cij ; COMP4300/8300 L21,22: Graphics Processing Units 2017 13

GPU Scheduling - Warps the GigaThread scheduler assigns the (independently executable) thread blocks to each SM each block is divided into groups (of 32) called warps grouping occurs in linear order by t x + b x t y + b x b y t z (e.g. warp size 4) the warp scheduler determines which blocks are ready to run with 32-thread warps, suitable block sizes range from 4 8 to 16 16 SIMT: each SP executes next instr n SIMD-style (note: requires only a single instruction fetch!) thus, a kernel with enough blocks can scale across a GPU with any number of cores (courtesy NVIDIA - both) COMP4300/8300 L21,22: Graphics Processing Units 2017 14

threads within a single 1D block summing A[0..N 1]: } Reductions and Thread Divergence g l o b a l void sumv ( int N, double A, double s ) { int bx = blockdim. x ; s h a r e d double psum [ bx ]; int tx = threadid.x, x ; psum [ tx ] =... for ( x = bx /2; x>0; x /=2) { s y n c t h r e a d s (); if ( tx < x ) psum [ tx ] += psum [ tx + x ]; } if ( tx ==0) s = psum [ tx ]; predicated execution: threads in a warp where the condition is false execute a no-op if-else statements thus cause thread divergence (worse when nested) (courtesy NVIDIA) divergence is minimized: occurs only when x < 32 (on one warp) cf. alternative algorithm: for ( x =1; x < bx ; x =2) { s y n c t h r e a d s (); if ( tx % x == 0) psum [ tx ] += psum [ tx + x ]; } COMP4300/8300 L21,22: Graphics Processing Units 2017 15

Global Memory Bandwidth Issues in reduction example, all threads in warp contiguously access (shared) array psum very important when you have global memory accesses: memory subsystem can coalesce these into a single access allows DRAM banks to deliver peak bandwidth (burst mode) reason: 2D organization of DRAM chips (same row address) (Lect 3, p14) matmult example: threads within warp access A contiguously, but not B effect of accesses to B in this case is mitigated by use of shared memory in multiply note that this effect is opposite to normal cores, where contiguous access within a thread is most desirable (maximizes spatial locality) worst case scenario: memory strides in (large) powers of 2 causes memory bank conflicts COMP4300/8300 L21,22: Graphics Processing Units 2017 16

SM Registers and Warp Scheduling the SM maintains block ids of scheduled blocks, and thread ids (and block sizes) of scheduled threads the SM s (32K word) register file is shared between all of these the block and thread ids are used to index the file for the registers allocated to a particular thread warps whose next instruction has its operands ready for consumption may be selected round-robin used if there are several ready thus, registers need to be scoreboarded can make use of this to (software) prefetch data and better hide latencies (sh. mem. matmult) example: if there are 4 instrn s between a load & its use, on the G80, with 4 clock cycles needed to process an instrn., we need 14 active warps to tolerate a 200-cycle memory latency (courtesy NVIDIA) COMP4300/8300 L21,22: Graphics Processing Units 2017 17

Performance Considerations: Shared SM Resources on Fermi GPUs, may have resident on an SM: 8 blocks, 32 warps and 1536 threads; 128 KB register file, 64 KB shared memory / L1 cache to fully utilize block & thread slots, need at least 192 threads per block assuming 4-byte operands, can have at most 16 registers per thread optimizations on a kernel resulting in more registers may result in fewer blocks being resident... (courtesy NVIDIA) resource contention can cause a dramatic loss of performance the CUDA occupancy calculator can help evaluate this COMP4300/8300 L21,22: Graphics Processing Units 2017 18

Performance Considerations: Instruction Mix goal: keep the SP s FPUs fully occupied doing useful operations every other kind of instruction (loads, address calculations, branches) hinders this! matrix multiply revisited: strategy 1: unroll k loops: for ( int k =0; k < K ; k +=2) cij += A d [ i + k N ] B d [ k + j K ] + A d [ i +( k +1) N ] B d [ k +1+ j K ]; halves loop index increments & branches strategy 2: each thread computes a 2 2 tile of C instead of a single element reduces load instructions; reduces branches by 4 but may require 4 the registers! also increases thread granularity: may help if K is not large COMP4300/8300 L21,22: Graphics Processing Units 2017 19

Host-Device Issues: Multiple GPUs, NVLink, and Unified Memory transfer of data to/from host to device is error-prone, potentially a performance bottleneck (what if the array for an advection solver could not fit in GPU memory?) the problem is exacerbated when multiple GPUs are connected to one host we can select the required device by cudasetdevice(): cudasetdevice (0); cudamalloc ( a d, n ); cudamemcpy ( a d, a, n,...); reversearray <<<1,n/2>>>(a d, n ); cudathreadsynchronize (); cudamemcpypeer ( a b, 0, b d, 1, n ); cudasetdevice (1); reversearray <<<1,n/2>>>(b d, n ); fast interconnects such as NVLink will reduce the transfer costs (e.g. Sierra system) CUDA s Unified Memory will improve programability issues (and in some cases, performance) cudamallocmanaged(a, n); allocates the array on host so that it can migrate, page-by-page, to/from GPU(s) transparently and on demand alternatively, have the device and CPU use the same memory, as on AMD s APU for Exascale Computing COMP4300/8300 L21,22: Graphics Processing Units 2017 20

The Open Compute Language for Devices and Regular Cores open standard not proprietary like CUDA; based on C (no C++) design philosophy: treat GPUs and CPUs as peers, data- and task- parallel compute model similar execution model to CUDA: NDRange (CUDA grid): operates on global data, units within cannot synch. WorkGroup (CUDA block): units within can use shared ), to synch. local data (CUDA WorkItem (CUDA thread): indpt. unit of execution, also has private data example kernel: } k e r n e l void reversearray ( g l o b a l int a d, int N ) { int idx = getglobalid (0); int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; recall that in CUDA, we could launch as reversearray<<<1,n/2>>>(a d, N), but in OpenCL... COMP4300/8300 L21,22: Graphics Processing Units 2017 21

OpenCL Kernel Launch must explicitly create device handle, compute context and work-queue, load and compile the kernel, and finally enqueue it for execution clgetdeviceids (..., C L D E V I C E T Y P E G P U, 1, & device,...); context = clcreatecontext (0, 1, & device,...); queue = clcreatecommandqueue ( context, device,...); program = clcreateprogramwithsource ( context, " r e v e r s e A r r a y. cl ",. clbuildprogram ( program, 1, & device,...); r e v e r s e A r r k = clcreatekernel ( program, " r e v e r s e A r r a y ",...); clsetkernelarg ( reversearray k, 0, sizeof ( c l m e m ) & a d ); clsetkernelarg ( reversearray k, 0, sizeof ( int ) & N ); cndimension = 1; cnblocksize = N /2; clenqueuendrangekernel ( queue, reversearray k, 1, 0, & cndimension, & cnblocksize, 0, 0, 0); note: CUDA host code is compiled into.cubin intermediate files which follow a similar sequence for usage on normal core (CL DEVICE TYPE CPU), a WorkItem corresponds to an item in a work queue that a number of (kernel-level) threads get work from compiler may aggregate these to reduce overheads COMP4300/8300 L21,22: Graphics Processing Units 2017 22

Directive-Based Programming Models OpenACC enables us to specify which code is to run on a device, and how to transfer data to/from it # pragma acc parallel loop copyin (a, b ) copy ( c ) for ( i =0; i < N ; i ++) for ( int j =0; j < N ; j ++) { double cij = C [ i + j N ]; for ( int k =0; k < K ; k ++) cij += A [ i + k N ] B [ k + j K ]; C [ i + j N ] = cij ; } the data directive may be used to specify data placement across kernels the code can be also compiled to run across multiple CPUs OpenMP 4.0 operates similarly. For the above example: # pragma omp target map ( to : A [0: N K ], B [0: N K ]) map ( tofrom : C [0: N N ]) # pragma omp parallel for default ( shared ) studies on complex applications where all data must be kept on device indicate a productivity grain and performance loss of 2 over CUDA (e.g. Zhe14) COMP4300/8300 L21,22: Graphics Processing Units 2017 23

Graphics Processing Units: Summary designed to exploit computations expressible in large numbers of identical, independent threads grouped into blocks: allocated to an SM and hence can have synchronization within each GPU cores are designed for throughput, not single-thread speed low clock speed, instructions taking several clock cycles SIMT execution to hide long latencies; large amounts of hardware to maintain many thread contexts destructive sharing: appears as resource contention may lose performance due to poor utilization, but not from load imbalance L2 cache and memory bandwidth an important consideration, but main consideration in access patterns is within a warp COMP4300/8300 L21,22: Graphics Processing Units 2017 24