Mattan Erez. The University of Texas at Austin

Similar documents
EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Threading Hardware in G80

Introduction to CUDA (1 of n*)

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin

ME964 High Performance Computing for Engineering Applications

Portland State University ECE 588/688. Graphics Processors

CS427 Multicore Architecture and Parallel Computing

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

ME964 High Performance Computing for Engineering Applications

The NVIDIA GeForce 8800 GPU

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

Mattan Erez. The University of Texas at Austin

Spring Prof. Hyesoon Kim

An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Nvidia G80 Architecture and CUDA Programming

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GRAPHICS PROCESSING UNITS

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Intro to GPU s for Parallel Computing

Data Parallel Execution Model

Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

ME964 High Performance Computing for Engineering Applications

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

CSE 160 Lecture 24. Graphical Processing Units

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Programming in CUDA. Malik M Khan

Lecture 2: CUDA Programming

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Windowing System on a 3D Pipeline. February 2005

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

CS516 Programming Languages and Compilers II

Antonio R. Miele Marco D. Santambrogio

Parallel Processing SIMD, Vector and GPU s cont.

Master Informatics Eng.

NVIDIA Fermi Architecture

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

! Readings! ! Room-level, on-chip! vs.!

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Accelerating NAMD with Graphics Processors

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Fundamental CUDA Optimization. NVIDIA Corporation

Real-Time Graphics Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Programming Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Lecture 25: Board Notes: Threads and GPUs

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Hardware/Software Co-Design

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

GPU A rchitectures Architectures Patrick Neill May

Parallel Programming for Graphics

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

ECE/CS 757: Advanced Computer Architecture II GPGPUs

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

GPU ARCHITECTURE Chris Schultz, June 2017

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems

Paralization on GPU using CUDA An Introduction

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

The University of Texas at Austin

Mathematical computations with GPUs

ECE 574 Cluster Computing Lecture 15

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Real-Time Rendering Architectures

GPGPU introduction and network applications. PacketShaders, SSLShader

PERFWORKS A LIBRARY FOR GPU PERFORMANCE ANALYSIS

L14: CUDA, cont. Execution Model and Memory Hierarchy"

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

GPU ARCHITECTURE Chris Schultz, June 2017

Warps and Reduction Algorithms


Scheduling the Graphics Pipeline on a GPU

ECE 8823: GPU Architectures. Objectives

Parallel Computing: Parallel Architectures Jin, Hai

A Data-Parallel Genealogy: The GPU Family Tree

Register File Organization

Computer Architecture

Maximizing Face Detection Performance

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

The Graphics Pipeline: Evolution of the GPU!

Transcription:

EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin

Outline 3D graphics recap and architecture motivation Overview and definitions Execution core architecture Memory architecture Moved to lecture 13 Most slides courtesy David Kirk (NVIDIA) and Wen-Mei Hwu (UIUC) From The University of Illinois ECE 498AI class A few slides courtesy David Luebke (NVIDIA) Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 2

A GPU Renders 3D Scenes A Graphics Processing Unit (GPU) accelerates rendering of 3D scenes Input: description of scene Output: colored pixels to be displayed on a screen Input: Geometry (triangles), colors, lights, effects, textures Output: Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 3

The NVIDIA GeForce Graphics Pipeline Matt 20 Host Vertex Control VS/T&L Vertex Triangle Setup Raster Shader ROP FBI Texture Frame Buffer Memory Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 4

Another View of the 3D Graphics Pipeline Host Vertex Control stream of vertices (N V ~100K) VS: Transform stream of vertices (N V ) VS: Geometry stream of vertices (N V ) VS: Lighting stream of vertices (N V ) VS: Setup stream of vertices (N V ) Raster stream of fragments (N F ~10M) FS 0 stream of fragments (N F ) FS 1 stream of fragments (N F ) Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 5

The NVIDIA GeForce Graphics Pipeline Matt 20 Host Vertex Control Vertex Shader Vertex Shader Vertex Shader Vertex Shader Triangle Setup Raster Fragment Shader Fragment Shader Fragment Shader Fragment Shader ROP FBI Frame Buffer Memory Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 6

Stream Execution Model Data parallel streams of data Processing kernels Unit of Execution is processing of one stream element in one kernel defined as a thread Kernel 1 Kernel 2 Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 7

Stream Execution Model Can partition the streams into chunks Streams are very long and elements are independent Chunks are called strips or blocks Kernel 1 Kernel 2 Unit of Execution is processing one block of data by one kernel defined as a thread block Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 8

Outline 3D graphics recap and architecture motivation Overview and definitions Execution core architecture Memory architecture Most slides courtesy David Kirk (NVIDIA) and Wen-Mei Hwu (UIUC) From The University of Illinois ECE 498AI class Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 9

Make the Compute Core The Focus of the Architecture The future of GPUs is programmable processing So build the architecture around the processor Host Input Assembler Vtx Issue Geom Issue Setup / Rstr / ZCull Pixel Issue L1 L1 L1 L1 L1 L1 L1 L1 Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 10

Vertex and Fragment Processing Share Unified Processing Elements Mattan NVIDIA Erez Corp., 2007 EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 11

Vertex and Fragment Processing Share Unified Processing Elements Load balancing HW is a problem Vertex Shader Pixel Shader Idle hardware Heavy Geometry Workload Perf = 4 Vertex Shader Idle hardware Pixel Shader Heavy Pixel Workload Perf = 8 Mattan NVIDIA Erez Corp., 2007 EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 12

Vertex and Fragment Processing Share Unified Processing Elements Load balancing SW is easier Unified Shader Vertex Workload Pixel Heavy Geometry Workload Perf = 11 Unified Shader Pixel Workload Vertex Heavy Pixel Workload Perf = 11 Mattan NVIDIA Erez Corp., 2007 EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 13

Vertex and Fragment Processing is Dynamically Load Balanced Less Geometry More Geometry Unified Shader Usage High pixel shader use Low vertex shader use Balanced use of pixel shader and vertex shader Mattan NVIDIA Erez Corp., 2007 EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 14

Make the Compute Core The Focus of the Architecture Processors The future of execute GPUs is computing programmable threads processing Alternative So build the operating architecture mode around specifically the processor for computing Host Input Input Assembler Execution Manager Vtx Issue Geom Issue Setup / Rstr / ZCull Pixel Issue Parallel Data Parallel Data Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Processor Load/store L2 Load/storeL2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 FB FB FB Global Memory FB FB FB Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 15

Make the Compute Core The Focus of the Architecture Processors execute computing threads Alternative operating mode specifically for computing Host Input Assembler Execution Manager Manages thread blocks Only one kernel at a time Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 16

Outline 3D graphics recap and architecture motivation Overview and definitions Execution core architecture Memory architecture Most slides courtesy David Kirk (NVIDIA) and Wen-Mei Hwu (UIUC) From The University of Illinois ECE 498AI class Mattan Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 17

Compute Core Host Input Assembler Execution Manager Manages thread blocks Only one kernel at a time TPC Parallel (texture Data processor cluster) Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 18

GeForce-8 Series HW Overview Streaming Processor Array TPC TPC TPC TPC TPC TPC Texture Processor Cluster SM Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch TEX Shared Memory SM SFU SFU Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 19

CUDA Processor Terminology A Streaming Processor Array Array of TPCs 8 TPCs in GeForce8800 TPC Texture Processor Cluster Cluster of 2 SMs + 1 TEX TEX is a texture processing unit SM Streaming Multiprocessor Array of 8 s Multi-threaded processor core Fundamental processing unit for a thread block Streaming Processor Scalar ALU for a single thread With 1K of registers Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 20

Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) 8 Streaming Processors () 2 Super Function Units (SFU) Multi-threaded instruction dispatch Vectors of 32 threads (warps) Up to 16 warps per thread block HW masking of inactive threads in a warp s cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory 32 KB in registers DRAM texture and memory access Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SFU SFU Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 21

Life Cycle in HW Kernel is launched on the A Kernels known as grids of thread blocks Host Device Grid 1 Blocks are serially distributed to all the SM s Potentially >1 Block per SM At least 96 threads per block Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Each SM launches Warps of s 2 levels of parallelism SM schedules and executes Warps that are ready to run Kernel 2 Block (1, 1) Grid 2 As Warps and Blocks complete, resources are freed A can distribute more Blocks (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (2, 1) (3, 0) (3, 1) (4, 0) (4, 1) Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 22 (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

t0 t1 t2 tm Blocks SM Executes Blocks SM 0 MT IU Shared Memory MT IU Shared Memory Texture L1 Memory SM 1 L2 t0 t1 t2 tm Blocks s are assigned to SMs in Block granularity Up to 8 Blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. s run concurrently SM assigns/maintains thread IDs SM manages/schedules thread execution Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 23

Scheduling/Execution Each Block is divided into 32-thread Warps This is an implementation decision Warps are scheduling units in SM Block 1 Warps t0 t1 t2 t31 Block 2 Warps t0 t1 t2 t31 If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SFU SFU Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 24

SM Warp Scheduling SM hardware implements zerooverhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution All threads in a Warp execute the same instruction when selected Scoreboard scheduler 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 25

SM Instruction Buffer Warp Scheduling Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one ready-to-go warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp SM broadcasts the same instruction to 32 s of a Warp I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select MAD SFU Shared Mem Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 26

Scoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Status becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows Memory/Processor ops to proceed in shadow of Memory/Processor ops Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 27

Granularity and Resource Considerations For Matrix Multiplication, should I use 8X8, 16X16 or 32X32 tiles (1 thread per tile element)? For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it can take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. For 32X32, we have 1024 threads per Block. Not even one can fit into an SM! Erez EE382V: Principles of Computer Architecture, Fall 2007 -- Lecture 12 28