Lecture 11: GPU programming

Similar documents
CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to CUDA Programming

Module 2: Introduction to CUDA C. Objective

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Lecture 2: Introduction to CUDA C

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Module 2: Introduction to CUDA C

ECE 574 Cluster Computing Lecture 15

Tesla Architecture, CUDA and Optimization Strategies

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Lecture 3: Introduction to CUDA

Basics of CADA Programming - CUDA 4.0 and newer

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Parallel Computing. Lecture 19: CUDA - I

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA Programming Model

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CUDA C/C++ BASICS. NVIDIA Corporation

ECE 574 Cluster Computing Lecture 17

GPU programming. Dr. Bernhard Kainz

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Practical Introduction to CUDA and GPU

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Scientific discovery, analysis and prediction made possible through high performance computing.

HPCSE II. GPU programming and CUDA

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 5. Performance Programming with CUDA

CUDA Parallel Programming Model Michael Garland

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

University of Bielefeld

An Introduction to GPU Computing and CUDA Architecture

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Parallel Numerical Algorithms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Parallel Accelerators

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Parallel Accelerators

Introduction to CUDA C

High Performance Computing and GPU Programming

Real-time Graphics 9. GPGPU

Programmable Graphics Hardware (GPU) A Primer

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Vector Addition on the Device: main()

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA (Compute Unified Device Architecture)

CUDA Performance Optimization. Patrick Legresley

Lecture 10!! Introduction to CUDA!

GPU Programming Using CUDA

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

COSC 6374 Parallel Computations Introduction to CUDA

Overview: Graphics Processing Units

Using a GPU in InSAR processing to improve performance

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Real-time Graphics 9. GPGPU

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA Architecture & Programming Model

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Scientific Computations Using Graphics Processors

Introduction to CUDA C

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

High Performance Linear Algebra on Data Parallel Co-Processors I

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Introduction to GPGPUs and to CUDA programming model

EEM528 GPU COMPUTING

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

COSC 6339 Accelerators in Big Data

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Transcription:

Lecture 11: GPU programming David Bindel 4 Oct 2011

Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2! Start thinking about possible projects...

Matrix multiply outcome 4000 3000 MFlop/s 2000 1000 0 0 100 200 300 400 500 600 700 800 n

HW 2 comments Due Thursday night don t wait until the last minute! This is not meant to be a hard assignment...... but leave time to get confused and ask questions. Three basic tasks: OpenMP: Parallelize by adding pragmas to code MPI: Fill in missing communication routine Both: Report on some performance experiments You can debug on your own computer Need recent gcc to get OpenMP support Need an MPI implementation I recommend OpenMPI Make sure to test with 1, 2, and 4 processes Make sure timings are done on the cluster worker nodes!

HW 2: Ghost cells revisited 0 1 2 3 4 5 6 7 8 9 10 Global node indices 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Local indices on P0 Local indices on P1 Local indices on P2

Notes on timing Different notions of time: clock() processor time omp_wtime() and MPI_Wtime() wall-clock time clock_gettime() depends! I/O generally does not count toward processor time Generally care about wall clock time

Notes on timing Timer resolution is limited! omp_get_wtick() timer resolution in OpenMP MPI_Wtick() same in MPI Do enough steps to get reasonable timings When reporting time vs size, it s reasonable to look at time/step

... and now on to the main event...

Some history Late 80s-early 90s: golden age for supercomputing Companies: Thinking Machines, MasPar, Cray Relatively fast processors (vs memory) Lots of academic interest and development But got hard to compete with commodity hardware Scientific computing is not a market driver! 90s-early 2000s: age of the cluster Beowulf, grid computing, etc. Big iron also uses commodity chips (better interconnect) Past few years CPU producers move to multicore High-end graphics becomes commodity HW Gaming is a market driver! GPU producers realize their many-core designs can apply to general purpose computing

Thread design points Threads on desktop CPUs Implemented via lightweight processes (for example) General system scheduler Thrashing when more active threads than processors An alternative approach Hardware support for many threads / CPU Modest example: hyperthreading More extreme: Cray MTA-2 and XMT Hide memory latency by thread switching Want many more independent threads than cores GPU programming Thread creation / context switching are basically free Want lots of threads (thousands for efficiency?!)

General-purpose GPU programming Old GPGPU model: use texture mapping interfaces People got good performance! But too clever by half CUDA (Compute Unified Device Architecture) More natural general-purpose programming model Initial release in 2007; now in version 3.0 OpenCL Relatively new (late 2009); in Apple s Snow Leopard Open standard (Khronos group) includes NVidia, ATI, etc And so on: DirectCompute (MS), Brook+ (Stanford/AMD), Rapidmind (Waterloo (Sh)/Rapidmind/Intel?) Today: C for CUDA (more available examples)

Compiling CUDA nvcc is the driver Builds on top of g++ or other compilers nvcc driver produces CPU and PTX code PTX (Parallel Thread execution) Virtual machine and ISA Compiles down to binary for target Can compile in device emulation mode for debug nvcc -deviceemu Can use native debug support Can access data across host/device boundaries Can call printf from device code

CUDA programming do_something_on_cpu(); some_kernel<<<nblk, ntid>>>(args); do_something_else_on_cpu(); cudathreadsynchronize(); Highly parallel kernels run on device Vaguely analogous to parallel sections in OpenMP code Rest of the code on host (CPU) C + extensions to program both host code and kernels

Thread blocks Monolithic thread array partitioned into blocks Blocks have 1D or 2D numeric identifier Threads within blocks have 1D, 2D, or 3D identifier Identifiers help figure out what data to work on Blocks cooperate via shared memory, atomic ops, barriers Threads in different blocks cannot cooperate... except for implied global barrier from host

Memory access Registers are registers; per thread Shared memory is small, fast, on-chip; per block Global memory is large uncached off-chip space Also accessible by host Also runtime support for texture memory and constant memory.

Basic usage 1. Perform any needed allocations 2. Copy data from host to device 3. Invoke kernel 4. Copy results from device to host 5. Clean up allocations

Device memory management h_data = malloc(size);... Initialize h_data on host... cudamalloc((void**) &d_data, size); cudamemcpy(d_data, h_data, size, cudamemcpyhosttodevice);... invoke kernel... cudamemcpy(h_data, d_data, size, cudamemcpydevicetohost); cudafree(d_data); free(h_data); Notes: Don t dereference h_data on device or d_data on host! Can also copy host-to-host, device-to-device Kernel invocation is asynchronous with CPU; cudamemcpy is synchronous (can synchronize kernels with cudathreadsynchronize)

CUDA function declarations device float device_func(); global void kernel_func(); host float host_func(); global for kernel (must return void) device functions called and executed on device host functions called and executed on host device and host can be used together

Restrictions on device functions No taking the address of a device function No recursion No static variables inside the function No varargs

Kernel invocation Kernels called with an execution configuration: global void kernel_func(...); dim3 dimgrid(100, 50); // 5000 thread blocks dim3 dimblock(4, 8, 8); // 256 threads per block size_t sharedmembytes = 64; kernel_func<<dimgrid, dimblock, sharedmembytes>>(...); Can write integers (1D layouts) for first two arguments Third argument is optional (defaults to zero) Optional fourth argument for stream of execution Used to specify asynchronous execution across kernels Kernel can fail if you request too many resources

Example: Vector addition global void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < N) C[i] = A[i] + B[i]; } cudamalloc((void**)&d_a, size); cudamalloc((void**)&d_b, size); cudamalloc((void**)&d_c, size); cudamemcpy(d_a, h_a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, h_b, size, cudamemcpyhosttodevice); int threadsperblock = 256; int blockspergrid = (N+255) / threadsperblock; VecAdd<<<blocksPerGrid, threadsperblock>>>(d_a,d_b,d_c,n); cudamemcpy(h_c, d_c, size, cudamemcpydevicetohost); cudafree(d_a); cudafree(d_b); cudafree(d_c);

Shared memory Size known at compile time global void kernel(...) { shared float x[256];... } kernel<<<nb,bs>>>(...); Size known at kernel launch global void kernel(...) { extern shared float x[];... } kernel<<<nb,bs,bytes>>>(...); Synchronize access with barrier.

Example: Butterfly reduction On input (step 0): 2 b numbers At step i, entry j becomes sum over all inputs whose indices agree with j in the last b j bits On output (step b): 2 b copies of the sum

Example: Butterfly reduction 00 x0 xx 01 x1 xx 10 x0 xx 11 x1 xx

Example: Butterfly reduction global void sum_reduce(int* x) { // B is a compile time constant power of 2 int i = threadidx.x; shared int sum[b]; sum[i] = x[i]; syncthreads(); for (int bit = B/2; bit > 0; bit /= 2) { int inbr = (i + bit) % B; int t = sum[i] + sum[inbr]; syncthreads(); sum[i] = t; syncthreads(); } } sum_reduce<<1,n>>(d_x);

General picture: CUDA extensions Type qualifiers: global device shared local constant Keywords (threadidx, blockidx) Intrinsics ( syncthreads) Runtime API (memory, symbol, execution management) Function launch

Libraries and languages The usual array of language tools exist: CUBLAS, CUFFT, CUDA LAPACK bindings (commercial) CUDA-accelerated libraries (e.g. in Trilinos) Bindings to CUDA from Python, Java, etc

Hardware picture (G80) 128 processors execute threads Thread Execution Manager issues threads Parallel data cache / shared memory per processor All have access to device memory Partitioned into global, constant, texture spaces Read-only caches to texture and constant spaces

HW thread organization Single Instruction, Multiple Thread A warp of threads executes physically in parallel (one warp == 32 parallel threads) Blocks are partitioned into warps by consecutive thread ID Best efficiency when all threads in warp do same operation Conditional branches reduce parallelism serially execute all paths taken

Memory architecture Memory divided into 16 banks of 32-byte words Each bank services one address per cycle Conflicting accesses are serialized Stride 1 (or odd stride): no bank conflicts

Batch memory access: coalescing Coalescing is a coordinated read by half-warp Read contiguous region (64, 128, or 256 bytes) Starting address for region a multiple of region size Thread k in half-warp accesses element k of blocks Not all threads need to participate

The usual picture Performance is potentially quite complicated!... and memory is important. Fortunately, there are profiling tools included Unfortunately, I have yet to play with them!

Resources Beside the basic NVidia documentation, see: http: //developer.nvidia.com/object/cuda_training.html http://courses.ece.illinois.edu/ece498/al/ http://gpgpu.org/developer