Threading Hardware in G80

Similar documents
Introduction to CUDA (1 of n*)

Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Portland State University ECE 588/688. Graphics Processors

ME964 High Performance Computing for Engineering Applications

CS427 Multicore Architecture and Parallel Computing

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Intro to GPU s for Parallel Computing

The NVIDIA GeForce 8800 GPU

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Spring Prof. Hyesoon Kim

ME964 High Performance Computing for Engineering Applications

Master Informatics Eng.

CUDA Programming Model

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

Windowing System on a 3D Pipeline. February 2005

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

GRAPHICS PROCESSING UNITS

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

From Brook to CUDA. GPU Technology Conference

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Nvidia G80 Architecture and CUDA Programming

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Data Parallel Execution Model

CS516 Programming Languages and Compilers II

Lecture 2: CUDA Programming

CUDA (Compute Unified Device Architecture)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Mattan Erez. The University of Texas at Austin

CSE 160 Lecture 24. Graphical Processing Units

Tesla Architecture, CUDA and Optimization Strategies

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Real-Time Graphics Architecture

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

GPUs and GPGPUs. Greg Blanton John T. Lubia

ME964 High Performance Computing for Engineering Applications

GPU A rchitectures Architectures Patrick Neill May

Programming in CUDA. Malik M Khan

Lecture 1: Introduction and Computational Thinking

GeForce4. John Montrym Henry Moreton

Massively Parallel Architectures

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU ARCHITECTURE Chris Schultz, June 2017

Current Trends in Computer Graphics Hardware

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Introduction to GPU hardware and to CUDA

Fundamental CUDA Optimization. NVIDIA Corporation

B. Tech. Project Second Stage Report on

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPGPU introduction and network applications. PacketShaders, SSLShader

Fundamental CUDA Optimization. NVIDIA Corporation

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Tesla GPU Computing A Revolution in High Performance Computing

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

GPU Fundamentals Jeff Larkin November 14, 2016

GPU ARCHITECTURE Chris Schultz, June 2017

ECE 574 Cluster Computing Lecture 15

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Accelerating NAMD with Graphics Processors

Parallel Computing: Parallel Architectures Jin, Hai

GPGPU. Lecture 2: CUDA

Graphics Processing Unit Architecture (GPU Arch)

ECE 574 Cluster Computing Lecture 17

Parallel Computing. Lecture 19: CUDA - I

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Antonio R. Miele Marco D. Santambrogio

Computer Architecture

Lecture 3: Introduction to CUDA

Multi-Processors and GPU

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

High Performance Computing on GPUs using NVIDIA CUDA

GPU Memory Model. Adapted from:

PowerVR Hardware. Architecture Overview for Developers

Introduction to CUDA

The Bifrost GPU architecture and the ARM Mali-G71 GPU

General Purpose GPU Computing in Partial Wave Analysis

TUNING CUDA APPLICATIONS FOR MAXWELL

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Transcription:

ing Hardware in G80 1

Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2

3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command & Data Stream GPU GPU Front Front End End 3D API Commands Vertex Index Stream 3D 3D Application Application Or Or Game Game Primitive Primitive Assembly Assembly Assembled Primitives Rasterization and Interpolation Fixed-function pipeline CPU-GPU Boundary (AGP/PCIe) Pixel Location Stream Raster Operations Pixel Updates Frame Frame Buffer Buffer Pre-transformed Vertices Programmable Programmable Vertex Vertex Processor Processor Transformed Vertices Pre-transformed Fragments Programmable Programmable Fragment Fragment Processor Processor Transformed Fragments 3

3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command & Data Stream GPU GPU Front Front End End 3D API Commands Vertex Index Stream 3D 3D Application Application Or Or Game Game Primitive Primitive Assembly Assembly Assembled Primitives Rasterization and Interpolation Programmable pipeline CPU-GPU Boundary (AGP/PCIe) Pixel Location Stream Raster Operations Pixel Updates Frame Frame Buffer Buffer Pre-transformed Vertices Programmable Programmable Vertex Vertex Processor Processor Transformed Vertices Pre-transformed Fragments Programmable Programmable Fragment Fragment Processor Processor Transformed Fragments 4

3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command & Data Stream GPU GPU Front Front End End 3D API Commands Vertex Index Stream 3D 3D Application Application Or Or Game Game Primitive Primitive Assembly Assembly Assembled Primitives Rasterization and Interpolation Pixel Location Stream Unified Programmable pipeline CPU-GPU Boundary (AGP/PCIe) Raster Operations Pixel Updates Frame Frame Buffer Buffer Pre-transformed Vertices Transformed Vertices Pre-transformed Fragments Transformed Fragments Unified Unified Vertex, Vertex, Fragment, Fragment, Geometry Geometry Processor Processor 5

General Diagram (6800/NV40) 6

TurboCache Uses PCI-Express bandwidth to render directly to system memory Card needs less memory Performance boost while lowering cost TurboCache Manager dynamically allocates from main memory Local memory used to cache data and to deliver peak performance when needed 7

NV40 Vertex Processor An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle 8

NV40 Fragment Processors Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units 9

Why NV40 series was better Massive parallelism Scalability Lower end products have fewer pixel pipes and fewer vertex shader units Computation Power 222 million transistors First to comply with Microsoft s DirectX 9 spec Dynamic Branching in pixel shaders 10

Dynamic Branching Helps detect if pixel needs shading Instruction flow handled in groups of pixels Specify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines 11

General Diagram (7800/G70) 12

General Diagram (7800/G70) General Diagram (6800/NV40) 13

14

GeForce Go 7800 Power Issues Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Dynamic clock scaling can run as slow as 16 MHz This is true for the engine, memory, and pixel clocks Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won t get much battery-based runtime for a 3D game 15

GeForce 7800 GTX Parallelism 8 Vertex Engines Z-Cull Triangle Setup/Raster Shader Instruction Dispatch 24 Pixel Shaders Fragment Crossbar 16 Raster Operation Pipelines Partition Partition Partition Partition 16

G80 Graphics Mode The future of GPUs is programmable processing So build the architecture around the processor Host Input Assembler Setup / Rstr / ZCull Vtx Issue Geom Issue Pixel Issue TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 17

G80 CUDA mode A Device Example Processors execute computing threads New operating mode/hw interface for computing Host Input Assembler Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global 18

Why Use the GPU for Computing? The GPU has evolved into a very flexible and powerful processor: It s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS: GFLOPS GPU in every PC and workstation G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 19

What is Behind such an Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control Control CPU ALU ALU ALU ALU GPU Cache DRAM DRAM The fast-growing video game industry exerts strong economic pressure that forces constant innovation 20

What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting 21

Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities Limited outputs Instruction sets Lack of Integer & bit ops Communication limited Between pixels Scatter a[i] = p Input Registers Fragment Program Output Registers FB per thread per Shader per Context Texture Constants Temp Registers 22

An Example of Physical Reality GPU w/ local DRAM (device) Behind CUDA CPU (host) 23

Arrays of Parallel s A CUDA kernel is executed by an array of threads All threads run the same code (MD) Each thread has an ID that it uses to compute memory addresses and make control decisions threadid 0 1 2 3 4 5 6 7 float x = input[threadid]; float y = func(x); output[threadid] = y; 24

s: Scalable Cooperation Divide monolithic thread array into multiple blocks s within a block cooperate via shared memory, atomic operations and barrier synchronization s in different blocks cannot cooperate 0 0 N - 1 threadid 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; 25

Batching: Grids and s A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 (1, 1) (0, 0) Device (1, 0) Grid 1 (0, 0) (0, 1) Grid 2 (2, 0) (1, 0) (1, 1) (3, 0) (4, 0) (2, 0) (2, 1) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) Courtesy: NDVIA (4, 2) 26

and IDs s and blocks have IDs So each thread can decide what data to work on ID: 1D or 2D ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Device (1, 1) (0, 0) (1, 0) Grid 1 (0, 0) (0, 1) (2, 0) (1, 0) (1, 1) (3, 0) (4, 0) (2, 0) (2, 1) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA 27

CUDA Device Space Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid (0, 0) Shared Registers Registers (0, 0) (1, 0) (1, 0) Shared Registers Registers (0, 0) (1, 0) Local Local Local Local The host can R/W global, constant, and texture memories Host Global Constant Texture 28

Global, Constant, and Texture Memories (Long Latency Accesses) Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host Contents visible to all threads (Device) Grid (0, 0) Registers Shared (0, 0) Local Registers (1, 0) Local (1, 0) Registers Shared (0, 0) Local Registers (1, 0) Local Host Global Constant Texture Courtesy: NDVIA 29

IDs and IDs Each thread uses IDs to decide what data to work on ID: 1D or 2D ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 (0, 0) (0, 1) (1, 0) (1, 1) (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) (0,0,0) (1,0,0) (2,0,0) (3,0,0) (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA 30

CUDA Model Overview Global memory Main means of communicating R/W Data between host and device Grid (0, 0) (1, 0) Contents visible to all threads Shared Registers Registers Shared Registers Registers Long latency access We will focus on global memory for now Host (0, 0) (1, 0) Global (0, 0) (1, 0) Constant and texture memory will come later 31

Parallel Computing on a GPU 8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications Available in laptops, desktops, and clusters GeForce 8800 GPU parallelism is doubling every year Programming model scales transparently Tesla D870 Programmable in C with CUDA tools Multithreaded MD model uses application data parallelism and thread parallelism Tesla S870 32

Single-Program Multiple-Data (MD) CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args);... CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 33

Grids and s A kernel is executed as a grid of thread blocks All threads share global memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution using barrier Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 (0, 0) (0, 1) (1, 0) (1, 1) (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) (0,0,0) (0,1,0) (1,0,0) (1,1,0) (2,0,0) (2,1,0) (3,0,0) (3,1,0) Courtesy: NDVIA 34

CUDA Programmer declares () : size 1 to 512 concurrent threads shape 1D, 2D, or 3D dimensions in threads CUDA All threads in a execute the same thread program s share data and synchronize while doing their share of the work s have thread id numbers within program uses thread id to select work and address shared data Id #: 0 1 2 3 m program Courtesy: John Nickolls, NVIDIA 35

GeForce-8 Series HW Overview Streaming Processor Array TPC TPC TPC TPC TPC TPC Texture Processor Cluster SM Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch TEX Shared SM SFU SFU 36

A TPC SM CUDA Processor Terminology Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) Texture Processor Cluster (2 SM + TEX) Streaming Multiprocessor (8 ) Multi-threaded processor core Fundamental processing unit for CUDA thread block Streaming Processor Scalar ALU for a single CUDA thread 37

Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) 8 Streaming Processors () 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory texture and global memory access Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared SFU SFU 38

G80 Computing Pipeline The Processors future of execute GPUscomputing is programmable threadsprocessing So Alternative build the operating architecture mode around specifically the processor for computing Host Input Input Assembler Generates grids based on kernel calls Setup / Rstr / ZCull Execution Manager Vtx Issue Geom Issue Pixel Issue Parallel TF Data Parallel TF Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data TF TF TF TF TF TF Cache Cache Cache Cache Cache Cache Cache Cache Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Processor Load/store L2 Load/storeL2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 FB FB FB Global FB FB FB 39

Life Cycle in HW Grid is launched on the A s are serially distributed to all the SM s Potentially >1 per SM Each SM launches Warps of s 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and s complete, resources are freed A can distribute more s Host Kernel 1 Kernel 2 (1, 1) (0, 0) (0, 1) Device (1, 0) (1, 1) Grid 1 (0, 0) (0, 1) Grid 2 (2, 0) (2, 1) (1, 0) (1, 1) (3, 0) (3, 1) (4, 0) (4, 1) (2, 0) (2, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 40

SM Executes s t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU MT IU s s Shared TF Shared Texture L1 L2 s are assigned to SMs in granularity Up to 8 s to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. s run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution 41

Scheduling/Execution Each s is divided in 32- thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each has 256 threads, how many Warps are there in an SM? Each is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. 1 Warps t0 t1 t2 t31 SFU 2 Warps t0 t1 t2 t31 Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared SFU 42

SM Warp Scheduling time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 SM hardware implements zerooverhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency 43

SM Instruction Buffer Warp Scheduling Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one ready-to-go warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp SM broadcasts the same instruction to 32 s of a Warp I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select MAD SFU Shared Mem 44

Scoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled /Processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows /Processor ops to proceed in shadow of other waiting /Processor ops 45

Granularity Considerations For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles? For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! There are 8 warps but each warp is only half full. For 8X8, we have 64 threads per. Since each SM can take up to 768 threads, it could take up to 12 s. However, each SM can only take up to 8 s, only 512 threads will go into each SM! There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension For 16X16, we have 256 threads per. Since each SM can take up to 768 threads, it can take up to 3 s and achieve full capacity unless other resource considerations overrule. There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension For 32X32, we have 1024 threads per. Not even one can fit into an SM! 46

Hardware in G80 47

CUDA Device Space: Review Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant, and texture memories Host (Device) Grid (0, 0) Registers Local Global Constant Texture Shared (0, 0) Registers (1, 0) Local (1, 0) Registers Local Shared (0, 0) Registers (1, 0) Local 48

Parallel Sharing Grid 0 Local Shared Local : per-thread Private per thread Auto variables, register spill Shared : per- Shared by threads of the same block Inter-thread communication Global : per-application Shared by all threads Inter-Grid communication... Grid 1... Global Sequential Grids in Time 49

SM Architecture t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU MT IU s s Shared TF Courtesy: John Nicols, NVIDIA Shared Texture L1 L2 s in a block share data & results In and Shared Synchronize at barrier instruction Per- Shared Allocation Keeps data close to processor Minimize trips to global Shared is dynamically allocated to blocks, one of the limiting resources 50

SM Register File Register File (RF) 32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU 51

Programmer View of Register File There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself 4 blocks 3 blocks 52

Matrix Multiplication Example If each has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each block requires 10*256 = 2560 registers 8192 = 3 * 2560 + change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each now requires 11*256 = 2816 registers 8192 < 2816 *3 Only two s can run on an SM, 1/3 reduction of parallelism!!! 53

More on Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level parallelism 54