Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Size: px

Start display at page:

Download "Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know."

Maurice Byrd
6 years ago
Views:

1 Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza

2 Introduction to GPUs With many slides from Kayvon Fatahalian

3 Single Core CPU Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher

4 Most of this logic is to help serial programs run quickly. Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher

5 How do we speed this up? Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher

6 Multiple Cores L1 cache (32 KB) Core 1 L2 cache (256 KB). L3 cache (8 MB) L1 cache (32 KB) Core N L2 cache (256 KB)

7 SIMD Extensions Fetch/ Decode Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data Ctx Ctx Ctx Ctx Shared Ctx Data

8 Add lots of Cache Hides latency, lets multiple threads interleave. L1 cache (32 KB) Core 1. L2 cache (256 KB) L3 cache (8 MB) 25 GB/sec Memory DDR3 DRAM (Gigabytes) L1 cache (32 KB) Core N L2 cache (256 KB)

9 How does SIMD interact with control logic & cache? Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher

10 GPU idea: Throw out most of this. Fetch/ Decode ALU (Execute) Execution Context XData cache (a big one) Out-of-order control logic Fancy branch predictor Memory pre-fetcher

11 Core i7 4 Cores 8 SIMD ALUs / core

12 NVIDIA GTX cores 32 SIMD ALUs per core 1.3 TFLOPS

13 GTX-480 in more detail NVIDIA GTX 480 core Fetch/ Decode Fetch/ Decode = SIMD function unit, control shared across 16 units (1 MUL-ADD per clock) This process occurs on another set of 16 ALUs as well Execution contexts (128 KB) Shared memory (16+48 KB) So there are 32 ALUs per core = 480 ALUs per chip

14 CPU vs. GPU memory hierarchies Core 1 Core N. L1 cache (32 KB) L2 cache (256 KB) L1 cache (32 KB) L2 cache (256 KB) L3 cache (8 MB) 25 GB/sec Memory DDR3 DRAM (Gigabytes) CPU: Big caches, few threads, modest memory BW Rely mainly on caches and prefetching Core 1 Core N Execution contexts (128 KB). Execution contexts (128 KB). GFX texture cache (12 KB) Scratchpad L1 cache (64 KB) GFX texture cache (12 KB) Scratchpad L1 cache (64 KB) L2 cache (768 KB) 177 GB/sec Memory DDR5 DRAM (~1 GB) GPU: Small caches, many threads, huge memory BW Rely mainly on multi-threading ` CMU , Spring 201

15 More Threads Fetch/ Decode Fetch/ Decode NVIDIA GTX 480 core Execution contexts (128 KB) Shared memory (16+48 KB) 128 KB for Contexts Registers, program state, etc. Each core can have as many threads as it can hold their contexts. Fast switching between threads.

16 Many small contexts good latency hiding Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

17 Few large contexts poor latency hiding Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

18 Sounds great, what s the catch?

19 Sounds great, what s the catch? Every instruction is SIMD. All ALUs are doing exactly the same thing in lockstep.

20 Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F a = b + c d = e * a result = 3 * d

21 Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F a = b + c d = e * a result = 3 * d

22 Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F a = b + c d = e * a result = 3 * d

23 Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F a = b + c d = e * a result = 3 * d

24 What about Branches? Time (clocks) ALU 1 ALU ALU 8 if x > 0: tmp = x ** 5.0 else: tmp = 2 * tmp result = tmp + 1

25 What about Branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F if x > 0: tmp = x ** 5.0 else: tmp = 2 * tmp result = tmp + 1

26 What about Branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F if x > 0: tmp = x ** 5.0 else: tmp = 2 * tmp result = tmp + 1 Not all ALUs do useful work! Worst case: 1/8 peak performance

27 What about Branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F if x > 0: tmp = x ** 5.0 else: tmp = 2 * tmp result = tmp + 1 Back to peak performance

28 Caches NVIDIA GTX 480 core Fetch/ Decode Fetch/ Decode Execution contexts (128 KB) Shared memory (16+48 KB) Instead of large caches, hide latency with many more threads, and high memory bandwidth.

29 CPU vs. GPU memory hierarchies Core 1 Core N. L1 cache (32 KB) L2 cache (256 KB) L1 cache (32 KB) L2 cache (256 KB) L3 cache (8 MB) 25 GB/sec Memory DDR3 DRAM (Gigabytes) CPU: Big caches, few threads, modest memory BW Rely mainly on caches and prefetching Core 1 Core N Execution contexts (128 KB). Execution contexts (128 KB). GFX texture cache (12 KB) Scratchpad L1 cache (64 KB) GFX texture cache (12 KB) Scratchpad L1 cache (64 KB) L2 cache (768 KB) 177 GB/sec Memory DDR5 DRAM (~1 GB) GPU: Small caches, many threads, huge memory BW Rely mainly on multi-threading ` CMU , Spring 201

30 GPUs - Summary Many, many Arithmetic Logic Units (ALUs). Many threads per core (efficiency & latency hiding). High memory bandwidth (for bandwidth-bound applications) Every instruction is SIMD within a core. Memory bandwidth has to be managed for peak performance (more on this later).

31 Incoming Wednesday Review (post questions!) Friday Intro to Odyssey (and OpenCL).

32 Time Elements 1 8 Elements 9 16 Elements Elements Stall Stall Runnable Stall Runnable Stall Done! Done! Runnable Runnable

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real