CUDA (Compute Unified Device Architecture)

Similar documents
NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CUDA Programming Model

Intro to GPU s for Parallel Computing

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Module 2: Introduction to CUDA C

NVIDIA GPU CODING & COMPUTING

Tesla Architecture, CUDA and Optimization Strategies

Threading Hardware in G80

Module 2: Introduction to CUDA C. Objective

CUDA C Programming Mark Harris NVIDIA Corporation

Parallel Computing. Lecture 19: CUDA - I

Introduction to CUDA (1 of n*)

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Lecture 3: Introduction to CUDA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPGPU. Lecture 2: CUDA

B. Tech. Project Second Stage Report on

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Lecture 2: Introduction to CUDA C

GPU programming. Dr. Bernhard Kainz

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

GPGPU/CUDA/C Workshop 2012

CUDA Architecture & Programming Model

Real-time Graphics 9. GPGPU

CSE 160 Lecture 24. Graphical Processing Units

Real-time Graphics 9. GPGPU

ECE 574 Cluster Computing Lecture 17

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Scientific discovery, analysis and prediction made possible through high performance computing.

Josef Pelikán, Jan Horáček CGG MFF UK Praha

High Performance Computing

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

High-Performance Computing Using GPUs

ECE 574 Cluster Computing Lecture 15

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CUDA Memories. Introduction 5/4/11

CS427 Multicore Architecture and Parallel Computing

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

L18: Introduction to CUDA

Introduction to CUDA Programming

CS179 GPU Programming Recitation 4: CUDA Particles

High Performance Linear Algebra on Data Parallel Co-Processors I

Lecture 1: an introduction to CUDA

GPUs and GPGPUs. Greg Blanton John T. Lubia

CS 179: GPU Computing

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to GPU hardware and to CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Portland State University ECE 588/688. Graphics Processors

General-purpose computing on graphics processing units (GPGPU)

Mathematical computations with GPUs

Tuning CUDA Applications for Fermi. Version 1.2

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Practical Introduction to CUDA and GPU

GPU Fundamentals Jeff Larkin November 14, 2016

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Lecture 1: Introduction and Computational Thinking

high performance medical reconstruction using stream programming paradigms

Lecture 11: GPU programming

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

GPU Computing with CUDA. Part 2: CUDA Introduction

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Master Informatics Eng.

Architecture: Caching Issues in Performance

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Architecture: Caching Issues in Performance

Accelerating CFD with Graphics Hardware

NVIDIA CUDA Compute Unified Device Architecture

GPU Programming Using NVIDIA CUDA

Lecture 2: CUDA Programming

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Blockseminar: Verteiltes Rechnen und Parallelprogrammierung. Tim Conrad

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

ECE/CS 757: Advanced Computer Architecture II GPGPUs

CS 677: Parallel Programming for Many-core Processors Lecture 1

CUDA Parallel Programming Model Michael Garland

Transcription:

CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 1

Why are GPUs Outpacing CPUs? Due to the nature of graphics computations, GPU chips are customized to handle streaming data. This means that the data is already sequential, or cache-coherent, and thus the GPU chips do not need the significant amount of cache space that dominates CPU chips. The GPU die real estate can then be re-targeted to produce more processing power. For example. while Intel and AMD are now shipping CPU chips with 4 cores, NVIDIA is shipping GPU chips with 128. Overall, in four years, GPUs have achieved a 17.5-fold increase in performance, a compound annual increase of 2.05X, which exceeds Moore s Law. What is Cache? In computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data is expensive to fetch (due to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache is a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in the cache, future use can be made by accessing the cached copy rather than refetching or recomputing the original data, so that the average access time is shorter. Cache, therefore, helps expedite data access that the CPU would otherwise need to fetch from main memory. -- Wikipedia 2

How Can You Gain Access to that GPU Power? 1. Write a graphics display program ( 1985) 2. Write an application that looks like a graphics display program ( 2002) 3. Write in CUDA, which looks like C++ ( 2006) CUDA Architecture The GPU has some number of MultiProcessors (MPs), depending on the model The NVIDIA 8800 comes in 2 models: either 12 or 16 MPs The NVIDIA 8600 has 4 MPs Each MP has 8 independent processors There are 16 KB of Shared per MP, arranged in 16 banks There are 64 KB of Constant 3

The CUDA Paradigm C++ Program with CUDA directives in it Compiler and Linker CPU binary CUDA binary on the GPU CUDA is an NVIDIAonly product, but it is likely that eventually all graphics cards will have something similar If GPUs have so Little Cache, how can they Execute General C++ Code Efficiently? 1. Multiple Multiprocessors 2. s lots and lots of threads CUDA expects you to not just have a few threads, but to have thousands of them! All threads execute the same code (called the kernel), but operates on different data Each thread can determine which one it is Think of all the threads as living in a pool, waiting to be executed All processors start by grabbing a thread from the pool When a thread gets blocked somehow (a memory access, waiting for information from another thread, etc.), the processor quickly returns the thread to the pool and grabs another one to work on. This thread-swap happens within a single cycle A full memory access requires 200 instruction cycles to complete 4

So, the Trick is to Break your Problem into Many, Many Small Pieces Particle Systems are a great example. 1. Have one thread per each particle. 2. Put all of the initial parameters into an array in GPU memory. 3. Tell each thread what the current Time is. 4. Each thread then computes its particle s position, color, etc. and writes it into arrays in GPU memory. 5. The CPU program then initiates drawing of the information in those arrays. Note: once setup, the data never leaves GPU memory Ben Weiss, CS 519 Organization: s are Arranged in Blocks A Block has: Size: 1 to 512 concurrent threads Shape: 1D, 2D, or 3D (really just a convenience) s have ID numbers within the Block The program uses these IDs to select work and pull data from memory s share data and synchronize while doing their share of the work A Block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate 5

Organization: Blocks are Arranged in Grids A CUDA program is organized as a Grid of Blocks Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) s Can Access Various Types of Storage Each thread has access to: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read-only per-grid constant memory Read-only per-grid texture memory Block (0, 0) Registers Shared (0, 0) Local Registers (1, 0) Local Block (1, 0) Registers Shared (0, 0) Local Registers (1, 0) Local The CPU can read and write global, constant, and texture memories Host Global Constant Texture 6

Rules You can have at most 512 s per Block s can share memory with the other s in the same Block s can synchronize with other s in the same Block Global, Constant, and Texture memory is accessible by all s in all Blocks Each has registers and local memory Each Block can use at most 8,192 registers, divided equally among all s You can be executing up to 8 Blocks and 768 s simultaneously per MP A Block is run on only one MP (i.e., cannot switch to another MP) A Block can be run on any of the 8 processors of its MP Types of CUDA Functions device float DeviceFunc() global void KernelFunc() host float HostFunc() Executed on the: GPU GPU CPU Only callable from the: GPU CPU CPU global defines a kernel function it must return void 7

Different Types of CUDA Location Cached Access Who Local Shared Global Constant Texture Off-chip On-chip Off-chip Off-chip Off-chip No N/A - resident No Yes Yes Read/write Read/write Read/write Read Read One thread All threads in a block All threads + CPU All threads + CPU All threads + CPU Types of CUDA Variables Declarations global, device, shared, local, constant Keywords threadidx, blockidx Intrinsics syncthreads Runtime API, symbol, execution management CUDA function launch device float filter[n]; global void convolve (float *image) { } shared float region[m];... region[threadidx] = image[i]; syncthreads();... image[j] = result; // Allocate GPU memory void *myimage = cudamalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 8

A CUDA can Query Where it Fits in its Community dim3 blockdim; Dimensions of the block in threads dim3 threadidx; index within the block dim3 griddim; Dimensions of the grid in blocks (griddim.z is not used) dim3 blockidx; Block index within the grid The CPU Invokes the CUDA Kernel using a Special Syntax CUDAFunction<<< NumBlocks, NumsPerBlock >>>( arg1, arg2, ) CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args);... CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 9

Rules of Thumb OpenGL Buffer Objects can be mapped into CUDA space CUDA kernel is asynchronous Can call cudasynchronize( ) from the application At least 16/12/4 Blocks must be run to fill the device The number of Blocks should be at least twice the number of MPs The number of s per Block should be a multiple of 64 192 or 256 are good numbers of s per Block An Example: Recomputing Particle Positions 10

An Example: Static Image De-noising An Example: Dynamic Scene Blurring 11

An Example: Realtime Fluids 512x512 pixels, mjb November 7626, FPS 2007 12