Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Similar documents
Introduction to GPU programming. Introduction to GPU programming p. 1/17

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Lecture 10!! Introduction to CUDA!

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CSE 160 Lecture 24. Graphical Processing Units

Introduction to Parallel Computing with CUDA. Oswald Haan

GPGPU A Supercomputer on Your Desktop

Josef Pelikán, Jan Horáček CGG MFF UK Praha

QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE. Website: gepura.io Bart Goossens

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

High Performance Computing and GPU Programming

Introduction to CUDA

Lecture 3: Introduction to CUDA

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Introduction to CUDA Programming

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Parallel Computing. Lecture 19: CUDA - I

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Portland State University ECE 588/688. Graphics Processors

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

Module 2: Introduction to CUDA C. Objective

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Lecture 3. Programming with GPUs

Real-time Graphics 9. GPGPU

Tesla Architecture, CUDA and Optimization Strategies

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Multi-Processors and GPU

Module 2: Introduction to CUDA C

Real-time Graphics 9. GPGPU

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Paralization on GPU using CUDA An Introduction

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU CUDA Programming

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Mattan Erez. The University of Texas at Austin

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Parallel Numerical Algorithms

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Hardware/Software Co-Design

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

University of Bielefeld

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Threading Hardware in G80

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Parallel Accelerators

Parallel Accelerators

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS427 Multicore Architecture and Parallel Computing

GPU Programming Using CUDA

GPU Programming with CUDA. Pedro Velho

Lecture 2: Introduction to CUDA C

Scientific discovery, analysis and prediction made possible through high performance computing.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA (1 of n*)

CUDA Programming Model

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

Massively Parallel Architectures

EEM528 GPU COMPUTING

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CS 179 Lecture 4. GPU Compute Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

GPU Programming. Rupesh Nasre.

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Lecture 5. Performance Programming with CUDA

ECE 574 Cluster Computing Lecture 15

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Massively Parallel Algorithms

Introduc)on to GPU Programming

Lecture 1: an introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA

Programming in CUDA. Malik M Khan

ECE 574 Cluster Computing Lecture 17

Using CUDA. Oswald Haan

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Warps and Reduction Algorithms

OpenMP and GPU Programming

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)


CUDA Programming. Aiichiro Nakano

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Transcription:

Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1

Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann Publishers, 2010. Lecture 15: Introduction to GPU programming p. 2

New trends of microprocessors Since 2003, there has been two main trajectories for microprocessor design Multicore a relatively small number of cores per chip, each core is a full-flesh processor in the traditional sense Many-core a large number of much smaller and simpler cores NVIDIA GeForce GTX 280 GPU (graphics processing unit) has 240 cores, each is heavily multi-threaded, in-order, single-instruction issue processor. Eight cores share control and instruction cache. As of 2009 peak performance of many-core GPUs is at around 10 fold the peak performance of multicore CPUs GPUs have larger memory bandwidth (simpler memory models and fewer legacy requirements) Lecture 15: Introduction to GPU programming p. 3

About GPU Design of GPU is shaped by video game industry, ability to perform a massive number of floating-point (single precision) calculations per video frame Full 3D pipeline: transform, lighting, rasterization, texturing, depth testing and display Computing on the earlier GPU architectures had to cast computing as graphics operations GeForce 3 in 2001 programmable pixel shading Later GeForce products separate programmable engines for vertex and geometry shading Lecture 15: Introduction to GPU programming p. 4

GPGPU General-purpose GPU capable of performing non-graphics processing running shader code against data presented as vertex or texture information computing results retrieved at later stage in the pipeline still awkward programming from the CPU perspective Unified shader architecture each shader core can be assigned with any shader task, no need for stage-by-stage balancing GeForce 8800 in 2006 (128 processing elements distributed among 8 shader cores) Tesla product line graphics cards without display outputs and drivers optimized for GPU computing instead of 3D rendering Lecture 15: Introduction to GPU programming p. 5

NVIDIA s Fermi architecture Designed for GPU computing (graphics-specific bits largely omitted) 16 streaming multiprocessors (SMs) 32 CUDA cores in each SM (512 cores in total) Each core executes one floating-point or integer instruction per clock 6 DRAM 64-bit interfaces GigaThread scheduler Peak double-precision floating-point rate: 768 GFLOPs Lecture 15: Introduction to GPU programming p. 6

CUDA CUDA Compute Unified Device Architecture C-based programming model for GPUs Introduced together with GeForce 8800 Joint CPU/GPU execution (host/device) A CUDA program consists of one of more phases that are executed on either host or device User needs to manage data transfer between CPU and GPU A CUDA program is a unified source code encompassing both host and device code Lecture 15: Introduction to GPU programming p. 7

Programming GPUs for computing (1) Kernel functions the computational tasks on GPU An application or library function may consist of one or more kernels Kernels can be written in C, extended with additional keywords to express parallelism Once compiled, kernels consist of many threads that execute the same program in parallel Multiple threads are grouped into thread blocks All threads in a thread block run on a single SM Within a thread block, threads cooperate and share memory A thread block is divided into warps of 32 threads Warp is the fundamental unit of dispatch within an SM Threads blocks may execute in any order Lecture 15: Introduction to GPU programming p. 8

Programming GPUs for computing (2) Thread blocks are grouped into grids, each executes a unique kernel Thread blocks and threads each have IDs, that specify their relationship to the kernel Simultaneous execution of multiple kernels, each kernel being assigned to one or more SMs A centralized scheduler When a kernel is invoked, it is executed as grid of parallel threads Threads in a grid are organized in a two-level hierarchy each grid consists of one or more thread blocks all blocks in a grid have the same number of threads each block has a unique 2D coordinates blockidx.x and blockidx.y each thread block is organized as a 3D array of threads threadidx.x, threadidx.y, threadidx.z The grid and thread block dimensions are set when a kernel is invoked Lecture 15: Introduction to GPU programming p. 9

Simple example of CUDA program The device is used to square each element of an array // Kernel that executes on the CUDA device global void square_array(float *a, int N) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx] * a[idx]; } Lecture 15: Introduction to GPU programming p. 10

Simple example of CUDA program (cont d) // main routine that executes on the host int main(void) { float *a_h, *a_d; // Pointer to host & device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host cudamalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<n; i++) a_h[i] = (float)i; cudamemcpy(a_d, a_h, size, cudamemcpyhosttodevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudamemcpy(a_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // Print results for (int i=0; i<n; i++) printf("%d %f\n", i, a_h[i]); // Cleanup free(a_h); cudafree(a_d); } Lecture 15: Introduction to GPU programming p. 11

CPU computing vs. GPU computing CPUs are good for applications where most of the work is done by a small number of threads, where the threads have high data locality, a mixture of different operations and conditional branches GPU aims at the other end of the spectrum data parallelism many arithmetic operations performed on data structures in a simultaneous manner applications with multiple threads that are dominated by longer sequences of computational instructions computationally intensive (not control-flow intensive) GPU computing will not replace CPU computing Lecture 15: Introduction to GPU programming p. 12