CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Similar documents
CS 179: GPU Programming

Real-time Graphics 9. GPGPU

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

Real-time Graphics 9. GPGPU

Threading Hardware in G80

CUDA (Compute Unified Device Architecture)

Programming shaders & GPUs Christian Miller CS Fall 2011

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA Architecture & Programming Model

Introduction to CUDA (1 of n*)

CUDA Programming Model

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Portland State University ECE 588/688. Graphics Processors

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA (1 of n*)

Mattan Erez. The University of Texas at Austin

ECE 574 Cluster Computing Lecture 17

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

High Performance Linear Algebra on Data Parallel Co-Processors I

Intro to GPU s for Parallel Computing

Lecture 2. Shaders, GLSL and GPGPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

B. Tech. Project Second Stage Report on

NVIDIA GPU CODING & COMPUTING

ECE 574 Cluster Computing Lecture 15

Heterogeneous-Race-Free Memory Models

Introduction to CUDA

Comparison of High-Speed Ray Casting on GPU

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

CS427 Multicore Architecture and Parallel Computing

Tesla Architecture, CUDA and Optimization Strategies

GPUs and GPGPUs. Greg Blanton John T. Lubia

Mattan Erez. The University of Texas at Austin

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

NVIDIA Parallel Nsight. Jeff Kiel

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Introduction to GPU hardware and to CUDA

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

GPU programming. Dr. Bernhard Kainz

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Accelerating CFD with Graphics Hardware

CS 179 GPU Programming

Current Trends in Computer Graphics Hardware

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL

Lecture 13: OpenGL Shading Language (GLSL)

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

GpuPy: Accelerating NumPy With a GPU

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Mathematical computations with GPUs

NVIDIA Fermi Architecture

Shaders. Slide credit to Prof. Zwicker

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS195V Week 9. GPU Architecture and Other Shading Languages

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Understanding Shaders and WebGL. Chris Dalton & Olli Etuaho

WebGL and GLSL Basics. CS559 Fall 2015 Lecture 10 October 6, 2015

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

General Purpose Computing on Graphical Processing Units (GPGPU(

University of Bielefeld

Practical Introduction to CUDA and GPU

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

GPGPU/CUDA/C Workshop 2012

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to Parallel Computing with CUDA. Oswald Haan

GPGPU introduction and network applications. PacketShaders, SSLShader

Accelerating image registration on GPUs

GPU Programming Using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Could you make the XNA functions yourself?

CS 179: GPU Computing

high performance medical reconstruction using stream programming paradigms

Introduction to Multicore Programming

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to Multicore Programming

WebGL and GLSL Basics. CS559 Fall 2016 Lecture 14 October

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Optimizing CUDA for GPU Architecture. CSInParallel Project

Transcription:

Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay

Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2

Shader GPGPU - Before current generation, this is all we had - Lots of things are faster in GLSL than on the CPU, but No scatter! No communication between fragments (threads) Awkward interface, need familiarity with graphics APIs Memory modes not what we would like Hard to transfer data from GPU to CPU 3

Shader GPGPU - Before current generation, this is all we had - Lots of things are faster in GLSL than on the CPU, but No scatter! No communication between fragments (threads) Awkward interface, need familiarity with graphics APIs Memory modes not what we would like Hard to transfer data from GPU to CPU 4

CUDA - nvidia s solution to GPGPU - Extension to the C language - Has been far more popular than CTM/Brook for GPGPU, and is thus the focus of this course - Still a proprietary environment Keep your eyes on OpenCL DirectX 11 includes a compute shader 5

CUDA - Only works on nvidia G80 (Geforce 8000 series) and newer cards - Can run in emulation on other hardware Floating point might not be exactly the same Watch out for OpenGL integration -Designed to scale well over time 6

CUDA - Compute Unified Device Architecture - A different way of looking at GPU programming - Provides far more features than we re used to from GL - Less hassle, more access to the hardware 7

What is a CUDA Program? - Two main parts, Host and Device - Host code Runs on CPU, uses special library calls.cpp or.cu -Device code Runs on GPU, written in C with some extensions Called kernels.cu 8

Host vs. Device -Host Code Single Program, Single Data Not parallel Typically few threads, threads take overhead -Device Code Single Program, Multiple Data Parallel Typically thousands of threads, very little overhead in thread creation 9

Graphics Mode Host Input Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 1

CUDA Mode Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 1

Overview of CUDA architecture 1

Basic Units of CUDA: The Grid - A grid is a group of threads all running the same kernel (not synchronized) - Every call to CUDA from CPU is one grid - Starting a grid on CPU is a synchronous operation - But can run multiple grids at once - On multi-gpu systems, grids cannot be shared between GPUs - use several grids for maximum efficiency 1

Basic Units of CUDA: The Block - Grids are composed of blocks - Each block is a logical unit containing a number of coordinating threads, a certain amount of shared memory. - Just as grids are not shared between GPUs, blocks are not shared between multiprocessors. 1

Basic Units of CUDA: The Block - All blocks in a grid use the same program - How do you tell what block you are in? - blockidx built-in variable - You can set up block IDs to be 1D or 2D (based on the grid dimensions) 1

Basic Units of CUDA: The Thread - Blocks are composed of threads - Threads are run on the individual cores of the multiprocessors, but unlike grids and blocks, they are not married to a single core - Like blocks, each thread has an ID (threadidx) - Thread IDs can be 1D, 2D, or 3D (based on block dimension) The thread id is relative to the block it is in 1

Overview of CUDA architecture 1

Thread Storage and Communication - Threads have a certain amount of register memory - Limited register memory per multiprocessor - Several ways of communicating with other threads within the block - Outside of the block, not a lot of communication Ideally, should be none 1

Memory Model 1

Memory Areas: Globals -Global memory Main communication between device and host Read/write from both device and host -Texture memory Read only from device Uses 2D hardware caching -Constant memory Read only from device -Persistent across grid runs 2

Memory Areas: Per-Block -Shared memory Accessible by all threads within a block Physically stored in each multiprocessor 2

Memory Areas: Per-Thread - Register Memory Efficient Can t index (no arrays!) Very limited space -Local Memory More space More access modes - Limited by physical memory in each multiprocessor - Unless you say otherwise, compiler will try to put things into register memory - Easy to run out of register memory 2

Memory Synchronization - All memory accesses are thread safe in the sense that one will happen (atomic), although order is undefined without explicit synchronization - If two threads write the same location at the same time, one will win - Try to avoid these situations no two threads should write the same location at the same time 2

Memory Areas: Comparison -We have a lot more memory tools than GLSL: Shared memory effectively allows communication between threads Global memory offers read/write access to memory areas directly accessible from CPU Allowed memory formats aren t restricted by graphics APIs we don t have to stuff data into textures; we just get a void*! 2

Synchronization - Basic unit of synchronization: syncthreads() Waits until all threads in block reach the call to syncthreads() Be careful of using in conditionals! Some threads may not reach it deadlock! OK if the conditional only depends on blockidx -No easy way of synchronizing between blocks Be careful of having blocks write to the same area of global memory 2

What did we really get? - Cleaner interface (no more ARB, EXT, etc.) - No more graphics APIs, opening windows, messing with textures, etc. - Scatter! - Read/write memory (no more ping-ponging) - Synchronization - Easy library calls to share data with CPU - Now that we have scatter and synchronization, the whole literature of parallel algorithms is easy to implement 2

CUDA Example - Now, let s use our new knowledge to figure out how to do a familiar problem efficiently in CUDA - Matrix multiplication - Y = A*B - For simplicity, A and B are square (NxN) 2

Dense Multiplication #1 - First attempt: use 1 block - Use NxN threads with 2D indexing - In each thread, compute the value of the corresponding element of Y in the normal way 2

What s wrong with attempt #1? - Biggest problem: 1 thread block means only 1 multiprocessor is in use - On the GTX280, this means we re only using 3% of our processing power! - Also, global memory accesses are out of control - Each source element is read N times from global memory! - We ll fix this later 2

Dense Multiplication #2 - Use N blocks, with N threads per block - Each thread calculates element determined by block and thread index 3

What s wrong with attempt #2? - Biggest problem: global memory access - Global memory reads are not particularly fast! - Want to minimize the amount of data read from global memory - Key idea: each thread in the block is accessing the same data - Can move the data into shared memory, which is much faster 3

Dense Multiplication #3 - Use N blocks, with N threads per block - Before any calculation, copy the row corresponding to each block into shared memory - Now we ve drastically cut down on global memory accesses 3

Potential Problems with #3 - Shared memory probably isn t large enough to store a whole row - Solution: store as much as we can at one time, then grab more. Problem with this: we would need to sync threads between each memory access; otherwise we may produce the wrong answer 3

Dense Multiplication - There is a trade-off to be made here - If we use fewer blocks, in general, memory accesses are less common - If we use more blocks, there could be more dead time in processors - Have to experiment to find best balance - In general, as size increases, fewer blocks will give better performance for a while, then suddenly will be terrible 3

Dense Multiplication - Other factors as well - Global vs. Texture memory? Memory caching: texture memory is heavily cached 3

How to Learn CUDA in 3 Steps 1. Read the CUDA Programming Guide 2. Look at the SDK 3. Repeat steps 1 and 2 for a while 3