From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Similar documents
From Shader Code to a Teraflop: How Shader Cores Work

Real-Time Rendering Architectures

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Scientific Computing on GPUs: GPU Architecture Overview

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Antonio R. Miele Marco D. Santambrogio

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Numerical Simulation on the GPU

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

Intro to Shading Languages

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Hardware/Software Co-Design

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Architecture & CUDA Programming

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Portland State University ECE 588/688. Graphics Processors

CUDA OPTIMIZATIONS ISC 2011 Tutorial

GPU Architecture. Michael Doggett Department of Computer Science Lund university

High Performance Computing on GPUs using NVIDIA CUDA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

GRAPHICS PROCESSING UNITS

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Lecture 23: Domain-Specific Parallel Programming

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU for HPC. October 2010

Parallel Programming on Larrabee. Tim Foley Intel Corp

Preparing seismic codes for GPUs and other

Modern Processor Architectures. L25: Modern Compiler Design

Lecture 10: Part 1: Discussion of SIMT Abstraction Part 2: Introduction to Shading

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Anatomy of AMD s TeraScale Graphics Engine

Addressing the Memory Wall

TUNING CUDA APPLICATIONS FOR MAXWELL

CS 426 Parallel Computing. Parallel Computing Platforms

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

CS427 Multicore Architecture and Parallel Computing

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Scheduling the Graphics Pipeline on a GPU

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Parallel Programming for Graphics

Fundamental Optimizations

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Efficient and Scalable Shading for Many Lights

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

GPUs and GPGPUs. Greg Blanton John T. Lubia

From Brook to CUDA. GPU Technology Conference

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Improving Performance of Machine Learning Workloads


ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Introduction. L25: Modern Compiler Design

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

ECE 574 Cluster Computing Lecture 16

Computer Architecture

CS195V Week 9. GPU Architecture and Other Shading Languages

The NVIDIA GeForce 8800 GPU

Performance Optimization Part II: Locality, Communication, and Contention

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Threading Hardware in G80

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Introduction to CUDA

Martin Kruliš, v

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

Introduction to CUDA (1 of n*)

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CME 213 S PRING Eric Darve

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Windowing System on a 3D Pipeline. February 2005

Graphics Processing Unit Architecture (GPU Arch)

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPUs have enormous power that is enormously difficult to use

Parallel Computing: Parallel Architectures Jin, Hai

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Transcription:

From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1

This talk Three major ideas that make GPU processing cores run fast Closer look at real GPU designs NVIDIA GTX 480 ATI Radeon 5870 The GPU memory hierarchy: moving data to processors 2

Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU core) designs 2. Understand how GPU cores do (and don t!) differ from CPU cores 3. Optimize shaders/compute kernels 4. Establish intuition: what workloads might benefit from the design of these architectures? 3

What s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) 4

What s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core Shader Core 4

What s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) Shader Core Shader Core Tex Shader Core Shader Core Tex Shader Core Shader Core Tex Shader Core Shader Core Tex 4

What s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) Shader Core Shader Core Tex Input Assembly Rasterizer Shader Core Shader Core Tex Output Blend Shader Core Shader Core Tex Video Decode Shader Core Shader Core Tex 4

What s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) Shader Core Shader Core Tex Input Assembly Rasterizer Shader Core Shader Core Tex Output Blend Shader Core Shader Core Tex Video Decode Shader Core Shader Core Tex Work Distributor HW or SW? 4

A diffuse reflectance shader sampler mysamp; Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } 5

A diffuse reflectance shader sampler mysamp; Texture2D<float3> mytex; float3 lightdir; Shader programming model: float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); Fragments are processed independently, but there is no explicit parallel programming kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } 5

Compile shader sampler mysamp; Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 6

Compile shader <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 6

Compile shader 1 unshaded fragment input record <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 1 shaded fragment output record 6

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 7

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 7

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 7

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 7

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 7

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 8

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 9

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 10

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 11

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 12

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 13

CPU-style cores Fetch/ Decode ALU (Execute) Execution Context 14

CPU-style cores Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context 14

CPU-style cores Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic 14

CPU-style cores Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic Fancy branch predictor 14

CPU-style cores Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher 14

Slimming down Fetch/ Decode ALU (Execute) Idea #1: Remove components that help a single instruction Execution Context stream run fast 15

Two cores (two fragments in parallel) Fetch/ Decode Fetch/ Decode ALU (Execute) ALU (Execute) Execution Context Execution Context 16

Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 ALU (Execute) ALU (Execute) <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) Execution Context Execution Context mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 16

Four cores (four fragments in parallel) Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context 17

Sixteen cores (sixteen fragments in parallel) 18

Sixteen cores (sixteen fragments in parallel) 16 cores = 16 simultaneous instruction streams 18

Instruction stream sharing But, many fragments should be able to share an instruction stream! <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 19

Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context 20

Add ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing Ctx Ctx Ctx Ctx Shared Ctx Data 21

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) Original compiled shader: Processes one fragment using scalar ops on scalar registers 22

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data New compiled shader: Processes eight fragments using vector ops on vector registers 23

Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data 24

128 fragments in parallel 25

128 fragments in parallel 16 cores = 128 ALUs 25

128 fragments in parallel 16 cores = 128 ALUs, 16 simultaneous instruction streams 25

vertices/fragments 128 [ primitives ] in parallel OpenCL work items CUDA threads vertices primitives fragments 26

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 27

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 28

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 28

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 peak performance if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 29

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 30

Clarification SIMD processing does not imply SIMD instructions Option 1: explicit vector instructions x86 SSE, Intel Larrabee Option 2: scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce ( SIMT warps), ATI Radeon architectures ( wavefronts ) 31

Clarification SIMD processing does not imply SIMD instructions Option 1: explicit vector instructions x86 SSE, Intel Larrabee Option 2: scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce ( SIMT warps), ATI Radeon architectures ( wavefronts ) In practice: 16 to 64 fragments share an instruction stream. 31

Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. 32

Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100 s to 1000 s of cycles We ve removed the fancy caches and logic that helps avoid stalls. 32

But we have LOTS of independent fragments. 33

But we have LOTS of independent fragments. Idea #3: Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations. 33

Hiding shader stalls Time (clocks) Frag 1 8 34

Hiding shader stalls Time (clocks) Frag 1 8 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data 34

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 35

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4 35

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable 36

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable 36

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Stall Runnable Stall Stall 37

Throughput! Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Start Stall Start Runnable Stall Start Runnable Stall Done! Runnable Done! Runnable Done! Done! 38

Throughput! Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Start Stall Start Runnable Stall Start Runnable Stall Done! Runnable Done! Runnable Increase run time of one group Done! to increase throughput of many groups Done! 38

Storing contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Pool of context storage 128 KB 39

Eighteen small contexts (maximal latency hiding) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 40

Twelve medium contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 41

Four large contexts (low latency hiding ability) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4 42

Clarification Interleaving between contexts can be managed by hardware or software (or both!) NVIDIA / ATI Radeon GPUs HW schedules / manages all contexts (lots of them) Special on-chip storage holds fragment state Intel Larrabee HW manages four x86 (big) contexts at fine granularity SW scheduling interleaves many groups of fragments on each HW context L1-L2 cache holds fragment state (as determined by SW) 43

Our chip 16 cores 8 mul-add ALUs per core (128 total) 16 simultaneous instruction streams 64 concurrent (but interleaved) instruction streams 512 concurrent fragments = 256 GFLOPs (@ 1GHz) 44

Our enthusiast chip 32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz) 45

Summary: three key ideas 1. Use many slimmed down cores to run in parallel 2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware 3. Avoid latency stalls by interleaving execution of many groups of fragments When one group stalls, work on another group 46

Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 480 ATI Radeon HD 5870 47

Disclaimer The following slides describe a reasonable way to think about the architecture of commercial GPUs Many factors play a role in actual chip performance 48

NVIDIA GeForce GTX 480 (Fermi) NVIDIA-speak: 480 stream processors ( CUDA cores ) SIMT execution Generic speak: 15 cores 2 groups of 16 SIMD functional units per core 49

NVIDIA GeForce GTX 480 core Fetch/ Decode = SIMD function unit, control shared across 16 units (1 MUL-ADD per clock) Groups of 32 [fragments/vertices/cuda threads] share an instruction stream Execution contexts (128 KB) Up to 48 groups are simultaneously interleaved Shared memory (16+48 KB) Up to 1536 individual contexts can be stored Source: Fermi Compute Architecture Whitepaper CUDA Programming Guide 3.1, Appendix G 50

NVIDIA GeForce GTX 480 core Fetch/ Decode = SIMD function unit, control shared across 16 units (1 MUL-ADD per clock) Fetch/ Decode The core contains 32 functional units Execution contexts (128 KB) Two groups are selected each clock (decode, fetch, and execute two instruction streams in parallel) Shared memory (16+48 KB) Source: Fermi Compute Architecture Whitepaper CUDA Programming Guide 3.1, Appendix G 51

NVIDIA GeForce GTX 480 SM Fetch/ Decode = CUDA core (1 MUL-ADD per clock) Fetch/ Decode The SM contains 32 CUDA cores Execution contexts (128 KB) Two warps are selected each clock (decode, fetch, and execute two warps in parallel) Shared memory (16+48 KB) Up to 48 warps are interleaved, totaling 1536 CUDA threads Source: Fermi Compute Architecture Whitepaper CUDA Programming Guide 3.1, Appendix G 52

NVIDIA GeForce GTX 480 There are 15 of these things on the GTX 480: That s 23,000 fragments! Or 23,000 CUDA threads! 53

ATI Radeon HD 5870 (Cypress) AMD-speak: 1600 stream processors Generic speak: 20 cores 16 beefy SIMD functional units per core 5 multiply-adds per functional unit (VLIW processing) 54

ATI Radeon HD 5870 core Fetch/ Decode Execution contexts (256 KB) Shared memory (32 KB) Groups of 64 [fragments/vertices/etc.] share instruction stream = SIMD function unit, control shared across 16 units (Up to 5 MUL-ADDs per clock) Four clocks to execute an instruction for all fragments in a group Source: ATI Radeon HD5000 Series: An Inside View (HPG 2010) 55

ATI Radeon HD 5870 SIMD-engine Fetch/ Decode Execution contexts (256 KB) Shared memory (32 KB) Groups of 64 [fragments/vertices/opencl work items] are in a wavefront. = stream processor, control shared across 16 units (Up to 5 MUL-ADDs per clock) Four clocks to execute an instruction for an entire wavefront Source: ATI Radeon HD5000 Series: An Inside View (HPG 2010) 56

ATI Radeon HD 5870 There are 20 of these cores on the 5870: that s about 31,000 fragments! 57

The talk thus far: processing data Part 3: moving data to processors 58

Recall: CPU-style core OOO exec logic Branch predictor Fetch/Decode ALU Data cache (a big one) Execution Context 59

Recall: CPU-style core OOO exec logic Branch predictor Fetch/Decode ALU Data cache (a big one) Execution Context 59

CPU-style memory hierarchy OOO exec logic Branch predictor Fetch/Decode ALU L1 cache (32 KB) L3 cache (8 MB) 25 GB/sec to memory Execution contexts L2 cache (256 KB) shared across cores CPU cores run efficiently when data is resident in cache (caches reduce latency, provide high bandwidth) 60

Throughput core (GPU-style) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 150 GB/sec ALU 5 ALU 6 ALU 7 ALU 8 Memory Execution contexts (128 KB) More ALUs, no large traditional cache hierarchy: Need high-bandwidth connection to memory 61

Bandwidth is a critical resource A high-end GPU (e.g. Radeon HD 5870) has... Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU No large cache hierarchy to absorb memory requests GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory requests to maximize use of memory bus Still, this is only six times the bandwidth available to CPU 62

Bandwidth thought experiment Task: element-wise multiply two long vectors A and B 1. Load input A[i] 2. Load input B[i] 3. Load input C[i] 4. Compute A[i] B[i] + C[i] 5. Store result into D[i] A B + C = D 63

Bandwidth thought experiment Task: element-wise multiply two long vectors A and B 1. Load input A[i] 2. Load input B[i] 3. Load input C[i] 4. Compute A[i] B[i] + C[i] 5. Store result into D[i] A B + C = D Four memory operations (16 bytes) for every MUL-ADD Radeon HD 5870 can do 1600 MUL-ADDS per clock Need ~20 TB/sec of bandwidth to keep functional units busy 63

Bandwidth thought experiment Task: element-wise multiply two long vectors A and B 1. Load input A[i] 2. Load input B[i] 3. Load input C[i] 4. Compute A[i] B[i] + C[i] 5. Store result into D[i] A B + C = D Four memory operations (16 bytes) for every MUL-ADD Radeon HD 5870 can do 1600 MUL-ADDS per clock Need ~20 TB/sec of bandwidth to keep functional units busy Less than 1% efficiency but 6x faster than CPU! 63

Bandwidth limited! If processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this. Overcoming bandwidth limits are a common challenge for GPU-compute application developers. 64

Reducing bandwidth requirements Request data less often (instead, do more math) arithmetic intensity Fetch data from memory less often (share/reuse data across fragments) on-chip communication or storage 65

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 66

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 1 2 3 4 Texture data 66

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 1 2 3 4 Texture data 66

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 1 2 3 4 Texture data 66

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 1 2 3 4 Texture data 66

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 1 2 3 4 Texture data 66

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 1 2 3 4 Texture caches: Capture reuse across Texture data fragments, not temporal reuse within a single shader program 66

Modern GPU memory hierarchy Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 Texture cache (read-only) ALU 5 ALU 6 ALU 7 ALU 8 L2 cache (~1 MB) Memory Execution contexts (128 KB) Shared local storage or L1 cache (64 KB) On-chip storage takes load off memory system. Many developers calling for more cache-like storage (particularly GPU-compute applications) 67

Summary 68

Think of a GPU as a heterogeneous multi-core processor optimized for maximum aggregate throughput. (currently at the extreme end of the design space) 69

Think about how various programming models map to these architectural characteristics. 70

An efficient GPU workload Has thousands of independent pieces of work Uses many ALUs on many cores Supports massive interleaving for latency hiding Is amenable to instruction stream sharing Maps to SIMD execution well Is compute-heavy: the ratio of math operations to memory access is high Not limited by memory bandwidth 71

Thank you Original Slides: Contributors: Kayvon Fatahalian Kurt Akeley Solomon Boulos Mike Doggett Pat Hanrahan Mike Houston Jeremy Sugerman 72