CUDA Architecture & Programming Model

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA

Practical Introduction to CUDA and GPU

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Programming Model

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU programming. Dr. Bernhard Kainz

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA C Programming Mark Harris NVIDIA Corporation

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GPU CODING & COMPUTING


GPU Computing with CUDA. Part 2: CUDA Introduction

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

High-Performance Computing Using GPUs

Introduction to CUDA Programming

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Josef Pelikán, Jan Horáček CGG MFF UK Praha

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Portland State University ECE 588/688. Graphics Processors

Introduction to CUDA (1 of n*)

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Paralization on GPU using CUDA An Introduction

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CSE 160 Lecture 24. Graphical Processing Units

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

ECE 574 Cluster Computing Lecture 15

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to GPGPUs and to CUDA programming model

Module 2: Introduction to CUDA C. Objective

ECE 574 Cluster Computing Lecture 17

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Programming with CUDA, WS09

Mathematical computations with GPUs

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Real-time Graphics 9. GPGPU

B. Tech. Project Second Stage Report on

Introduction to CUDA (1 of n*)

CUDA Basics. July 6, 2016

University of Bielefeld

CUDA Parallel Programming Model Michael Garland

332 Advanced Computer Architecture Chapter 7

NVIDIA CUDA Compute Unified Device Architecture

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Real-time Graphics 9. GPGPU

Mathematical computations with GPUs

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

HPCSE II. GPU programming and CUDA

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Programming Using CUDA

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Overview: Graphics Processing Units

CS : Many-core Computing with CUDA

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Module 2: Introduction to CUDA C

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 1: an introduction to CUDA

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 3. Programming with GPUs

GPU CUDA Programming

GPU Programming. Maciej Halber

CS 179 Lecture 4. GPU Compute Architecture

CS377P Programming for Performance GPU Programming - I

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA (Compute Unified Device Architecture)

Advanced Topics in CUDA C

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to CUDA 5.0

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

Shared Memory and Synchronizations

CUDA Programming. Aiichiro Nakano

NVIDIA Fermi Architecture

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Fundamental CUDA Optimization. NVIDIA Corporation

Transcription:

CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 2

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 3

Motivation: GPU vs. CPU May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 4

The Rise Of GPGPU Early 2000 s: Programmable shaders enable general purpose computing on GPUs But: Intimate knowledge of graphics pipeline/apis required, GPUs were powerful yet unflexible May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 5

The Rise Of GPGPU Early 2000 s: Programmable shaders enable general purpose computing on GPUs But: Intimate knowledge of graphics pipeline/apis required, GPUs were powerful yet unflexible A unified processor architecture was needed for both graphics and computing ( G80) Since 2007: CUDA Compute Unified Device Architecture Program GPUs intuitively with (extended) C May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 5

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 6

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 7

Fermi Architecture Overview 16 streaming multiprocessors, 512 cores in total May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 8

Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9

Fermi s Streaming Multiprocessor SIMT (single instruction, multiple threads) Hardware threading no overhead! Groups of 32 threads (warps) scheduled together Special Function Units (SFUs) for e.g. sin/cos, 1 x, x Scalable just add more SMs! May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 9

Fermi s Memory Hierarchy 64KB at block level, 768KB L2 Cache May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 10

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 11

What Tesla Couldn t Do: Fused Multiply-Add May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 12

What Else Was Improved Over Tesla Introduction of L1 and L2 caches Better double precision performance Atomic operations up to 20 times faster Concurrent kernel execution possible May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 13

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 14

Kepler Architecture Overview 1536 cores in total (though running at a lower shader clock rate than Fermi) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 15

Main Focus: Power Efficiency May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 16

Three CUDA Generations At A Glance May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 17

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 18

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 19

Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20

Grids, Blocks, Threads Threads map to cores Blocks map to SMs SMs schedule warps Grids & blocks up to 3 dimensions Threads in a block communicate thru shared memory synchronize at a barrier (_syncthreads()) Blocks in a grid communicate thru global memory synchronize only at end of kernel May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 20

Automatic Scalability May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 21

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 22

Software Stack (Libraries include CUBLAS, CUFFT, Thrust (STL),... ) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 23

Runtime Library And Built-ins Types / Functions Vector types: int2, dim3 (uint3), float4,... Math functions: sinf, powf, min,... Atomic functions: atomicadd(), atomicmax(),... Memory management: cudamalloc(), cudamemcpy(),... syncthreads() May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 24

Runtime Library And Built-ins Types / Functions Vector types: int2, dim3 (uint3), float4,... Math functions: sinf, powf, min,... Atomic functions: atomicadd(), atomicmax(),... Memory management: cudamalloc(), cudamemcpy(),... syncthreads() Variables dim3 threadidx dim3 blockidx dim3 blockdim dim3 griddim int warpsize Position within block Position within grid Size in threads Size in blocks Number of threads May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 24

Qualifiers Function types: global device host kernel, called from host, executed on device function called from device, executed on device function called from host, executed on host (optional) Variable types: device constant shared global, accessible by device and host (optional) constant, accessible by device (read only) and host shared, life span and access tied to block May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 25

Compilation NVCC separates serial and parallel parts Device code compiled to pseudo-assembly PTX (Parallel Thread Execution) Finally linked to one executable May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 26

Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New With Kepler? Programming Programming Model Software Framework Example Code May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 27

minibrot.cu May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 28

Host Code 1 s i z e _ t n = 30; / / side l e n g t h of canvas 2 s i z e _ t block = 5; / / side l e n g t h of a block 3 4 dim3 blockdim ( block, block ) ; 5 dim3 griddim ( ( n / block ) + 1, ( n / block ) + 1) ; 6 7 char arr_gpu ; 8 cudamalloc (& arr_gpu, n n s i z e o f ( char ) ) ; 9 10 mandelbrotkernel <<<griddim, blockdim >>>( arr_gpu, n ) ; 11 12 cudadevicesynchronize ( ) ; / / w ait f o r the k e r n e l to f i n i s h 13 14 char a r r = ( char ) malloc ( n n s i z e o f ( char ) ) ; 15 16 cudamemcpy ( arr, arr_gpu, n n s i z e o f ( char ), cudamemcpydevicetohost ) ; 17 18 p r i n t M a t r i x ( arr, n ) ; 19 20 f r e e ( a r r ) ; 21 cudafree ( arr_gpu ) ; May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 29

Device Code 1 global void mandelbrotkernel ( char arr, s i z e _ t n ) 2 { 3 u i n t 2 i d x ; / / p o s i t i o n on canvas 4 i d x. x = b l o ckidx. x blockdim. x + threadidx. x ; 5 i d x. y = b l o ckidx. y blockdim. y + threadidx. y ; 6 7 i f (! ( i d x. x < n && i d x. y < n ) ) r e t u r n ; 8 9 f l o a t 2 z = make_float2 ( 0. 0 f, 0.0 f ) ; 10 f l o a t 2 c = make_float2 ( 1.0 f + 2.0 f ( f l o a t ( i d x. x ) / n ), 11 1.0 f + 2.0 f ( f l o a t ( i d x. y ) / n ) ) ; 12 13 i n t i t e r = 0; 14 i n t maxiter = 100; 15 16 f o r ( ; i t e r < maxiter && ( z. x z. x + z. y z. y ) < 2.0 f ; ++ i t e r ) 17 z = make_float2 ( z. x z. x z. y z. y + c. x, 2.0 f z. x z. y + c. y ) ; 18 19 a r r [ i d x. x n + i d x. y ] = ( i t e r == maxiter )? # : ; 20 } May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 30

Sources Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39 55, March 2008. Nvidia Corporation. NVIDIA s Next Generation CUDA Compute Architecture: Fermi. 2009. Nvidia Corporation. NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built. 2012. Nvidia Corporation. NVIDIA CUDA C Programming Guide v. 4.2. 2012. (Plus slides from talks given in this course in previous years.) May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 31

Thanks...... for your attention! from: gizmodo.com.au/2009/05/giz_explains_gpgpu_computing_and_why_itll_melt_your_face_off-2 Questions? May 9, 2012 Oliver Taubmann CUDA Architecture & Programming Model 32