From Shader Code to a Teraflop: How Shader Cores Work

Similar documents
From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Real-Time Rendering Architectures

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Scientific Computing on GPUs: GPU Architecture Overview

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Antonio R. Miele Marco D. Santambrogio

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Numerical Simulation on the GPU

Intro to Shading Languages

Hardware/Software Co-Design

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Fundamental CUDA Optimization. NVIDIA Corporation

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPU Architecture & CUDA Programming

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

GPU for HPC. October 2010

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Programming on Larrabee. Tim Foley Intel Corp

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Lecture 23: Domain-Specific Parallel Programming

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Scheduling the Graphics Pipeline on a GPU

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Lecture 10: Part 1: Discussion of SIMT Abstraction Part 2: Introduction to Shading

Modern Processor Architectures. L25: Modern Compiler Design

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

High Performance Computing on GPUs using NVIDIA CUDA

Portland State University ECE 588/688. Graphics Processors

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Anatomy of AMD s TeraScale Graphics Engine

Parallel Programming for Graphics

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GRAPHICS PROCESSING UNITS

Parallel Processing SIMD, Vector and GPU s cont.

Introduction to CUDA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

ECE 574 Cluster Computing Lecture 16

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

GPUs and GPGPUs. Greg Blanton John T. Lubia

The NVIDIA GeForce 8800 GPU

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Windowing System on a 3D Pipeline. February 2005


Introduction to Parallel Programming Models

GPU Architecture & CUDA Programming

Threading Hardware in G80

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Graphics Processing Unit Architecture (GPU Arch)

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Preparing seismic codes for GPUs and other

Efficient and Scalable Shading for Many Lights

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS427 Multicore Architecture and Parallel Computing

Introduction. L25: Modern Compiler Design

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Mattan Erez. The University of Texas at Austin

Performance Optimization Part II: Locality, Communication, and Contention

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

GeForce4. John Montrym Henry Moreton

Improving Performance of Machine Learning Workloads

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Introduction to CUDA (1 of n*)

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Stanford CS149: Parallel Computing Written Assignment 3

CS195V Week 9. GPU Architecture and Other Shading Languages

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Graphics and Imaging Architectures

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Computer Architecture

When MPPDB Meets GPU:

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

From Brook to CUDA. GPU Technology Conference

Introduction to GPU computing

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Transcription:

From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University

This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA GTX 285 AMD Radeon 4890 Intel Larrabee 3. Memory hierarchy: moving data to processors 2

Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU processing core) designs 2. Optimize shaders/compute kernels 3. Establish intuition: what workloads might benefit from the design of these architectures? 3

What s in a GPU? Shader Core Shader Core Tex Input Assembly Rasterizer Shader Core Shader Core Tex Output Blend Shader Core Shader Core Shader Core Shader Core Tex Tex Video Decode Work Distributor HW or SW? Heterogeneous chip multi-processor (highly tuned for graphics) 4

A diffuse reflectance shader sampler mysamp; Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } Independent, but no explicit parallelism 5

Compile shader 1 unshaded fragment input record sampler mysamp; Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp ( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 1 shaded fragment output record 6

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 7

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 8

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 9

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 10

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 11

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 12

CPU- style cores Fetch/ Decode ALU (Execute) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher Data cache (A big one) 13

Slimming down Fetch/ Decode ALU (Execute) Execution Context Idea #1: Remove components that help a single instruction stream run fast 14

Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Decode Fetch/ Decode <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 15

Four cores (four fragments in parallel) Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context 16

Sixteen cores (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams 17

Instruction stream sharing But many fragments should be able to share an instruction stream! <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 18

Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context 19

Add ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data 20

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) Original compiled shader: Processes one fragment using scalar ops on scalar registers 21

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data New compiled shader: Processes 8 fragments using vector ops on vector registers 22

Modifying the shader 1 2 3 4 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx 5 6 7 8 <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data 23

128 fragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams 24

vertices / fragments primitives 128 [ ] in parallel CUDA threads OpenCL work items compute shader threads primitives vertices fragments 25

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 26

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 27

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 performance <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 28

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> 29

Clarification SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions Intel/AMD x86 SSE, Intel Larrabee Option 2: Scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce ( SIMT warps), AMD Radeon architectures In practice: 16 to 64 fragments share an instruction stream 30

Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100 s to 1000 s of cycles We ve removed the fancy caches and logic that helps avoid stalls. 31

But we have LOTS of independent fragments. Idea #3: Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations. 32

Hiding shader stalls Time (clocks) Frag 1 8 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data 33

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 34

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable 35

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable 36

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Stall Runnable Stall Runnable Stall Runnable 37

Throughput! Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable Start Stall Start Stall Start Runnable Stall Done! Runnable Done! Increase run time of one group Done! To maximum throughput of many groups Runnable Done! 38

Storing contexts Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Pool of context storage 64 KB 39

Twenty small contexts Fetch/ Decode (maximal latency hiding ability) ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 40

Twelve medium contexts Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 5 6 7 8 9 10 11 12 41

Four large contexts (low latency hiding ability) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 42

Clarification Interleaving between contexts can be managed by HW or SW (or both!) NVIDIA / AMD Radeon GPUs HW schedules / manages all contexts (lots of them) Special on-chip storage holds fragment state Intel Larrabee HW manages four x86 (big) contexts at fine granularity SW scheduling interleaves many groups of fragments on each HW context L1-L2 cache holds fragment state (as determined by SW) 43

My chip! 16 cores 8 mul-add ALUs per core (128 total) 16 simultaneous instruction streams 64 concurrent (but interleaved) instruction streams 512 concurrent fragments = 256 GFLOPs (@ 1GHz) 44

My enthusiast chip! 32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz) 45

Summary: three key ideas 1. Use many slimmed down cores to run in parallel 2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware 3. Avoid latency stalls by interleaving execution of many groups of fragments When one group stalls, work on another group 46

Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 285 AMD Radeon HD 4890 Intel Larrabee (as proposed) 47

Disclaimer The following slides describe how one can think about the architecture of NVIDIA, AMD, and Intel GPUs Many factors play a role in actual chip performance 48

NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT execution Generic speak: 30 cores 8 SIMD functional units per core 49

NVIDIA GeForce GTX 285 core 64 KB of storage for fragment contexts (registers) = SIMD functional unit, control shared across 8 units = multiply-add = multiply = instruction stream decode = execution context storage 50

NVIDIA GeForce GTX 285 core 64 KB of storage for fragment contexts (registers) Groups of 32 [fragments/vertices/threads/etc.] share instruction stream (they are called WARPS ) Up to 32 groups are simultaneously interleaved Up to 1024 fragment contexts can be stored 51

NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex There are 30 of these things on the GTX 285: 30,000 fragments! 52

NVIDIA GeForce GTX 285 Generic speak: 30 processing cores 8 SIMD functional units per core Best case: 240 mul-adds + 240 muls per clock 53

AMD Radeon HD 4890 AMD-speak: 800 stream processors HW-managed instruction stream sharing (like SIMT ) Generic speak: 10 cores 16 beefy SIMD functional units per core 5 multiply-adds per functional unit 54

AMD Radeon HD 4890 core = SIMD functional unit, control shared across 16 units = multiply-add = instruction stream decode = execution context storage 55

AMD Radeon HD 4890 core Groups of 64 [fragments/vertices/etc.] share instruction stream (AMD doesn t have a fancy name like WARP ) One fragment processed by each of the 16 SIMD units Repeat for four clocks 56

AMD Radeon HD 4890 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 57

AMD Radeon HD 4890 Generic speak: 10 processing cores 16 beefy SIMD functional units per core 5 multiply-adds per functional unit Best case: 800 multiply-adds per clock Scale of interleaving similar to NVIDIA GPUs 58

Intel Larrabee Intel speak: We won t say anything about core count or clock rate Explicit 16-wide vector ISA Each core interleaves four x86 instruction streams Software implements additional interleaving Generic speak: That was the generic speak 59

Intel Larrabee core Each HW context: 32 vector registers 32 KB of L1 cache 256 KB of L2 cache = SIMD functional unit, control shared across 16 units = mul-add = instruction stream decode = execution context storage/ HW registers 60

Intel Larrabee??? Tex Tex Tex Tex 61

The talk thus far: processing data Part 3: moving data to processors 62

Recall: CPU- style core OOO exec logic branch pred. Fetch/Decode ALU Execution Context Data cache (big one) 63

CPU- style memory hierarchy OOO exec logic branch pred. Fetch/Decode ALU Execution Context L1 cache 32 KB L2 cache 256 KB L3 cache 8 MB (shared across cores) 25 GB/sec to memory CPU cores run efficiently when data is resident in cache (caches reduce latency, provide high bandwidth) 64

Throughput core (GPU-style) Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 150 GB/sec Memory Execution Contexts (64 KB) More ALUs, no traditional cache hierarchy: Need high-bandwidth connection to memory 65

Bandwidth is critical On a high-end GPU: 11x compute performance of high-end CPU 6x bandwidth to feed it No complicated cache hierarchy GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory requests to maximize use of memory bus 66

Bandwidth thought experiment Element-wise multiply two long vectors A and B Load input A[i] Load input B[i] Multiply Store result C[i] 3 memory operations every 4 cycles (12 bytes) Needs ~1 TB/sec of bandwidth on a high-end GPU 7x available bandwidth 15% efficiency but 6x faster than high-end CPU! 67

Bandwidth limited! If processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this. Overcoming bandwidth limits are a common challenge for GPU-compute application developers. 68

Reducing required bandwidth Request data less often (do more math) Share/reuse data across fragments (increase on-chip storage) 69

Reducing required bandwidth Two examples of on-chip storage Texture cache CUDA shared memory ( OpenCL local ) 1 2 3 4 Texture data 70

GPU memory hierarchy Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Execution Contexts (64 KB) Texture cache (read only) Shared local storage (16 KB) Memory On-chip storage takes load off memory system Many developers calling for larger, more cache-like storage 71

Summary 72

Think of a GPU as a multi-core processor optimized for maximum throughput. (currently at extreme end of design space) 73

An efficient GPU workload Has thousands of independent pieces of work Uses many ALUs on many cores Supports massive interleaving for latency hiding Is amenable to instruction stream sharing Maps to SIMD execution well Is compute-heavy: the ratio of math operations to memory access is high Not limited by bandwidth 74

Acknowledgements Kurt Akeley Solomon Boulos Mike Doggett Pat Hanrahan Mike Houston Jeremy Sugerman 75

Thank you http://gates381.blogspot.com 76