CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Size: px

Start display at page:

Download "CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST"

Zoe White
6 years ago
Views:

1 CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST

2 Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter 3 (Introduction to CUDA) Programming Massively Parallel Processors book, Chapter 4 (CUDA Threads) until (including) 4.3 Read (optional): NVIDIA Fermi graphics (GF100) and compute white papers: NVIDIA Kepler (GK110) white papers: NVIDIA Maxwell (GM107) white paper: Ti-Whitepaper.pdf 2

3 From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University SIGGRAPH 2009: Beyond Programmable Shading:

5 My chip! 16 cores 8 mul-add ALUs per core (128 total) 16 simultaneous instruction streams 64 concurrent (but interleaved) instruction streams 512 concurrent fragments = 256 GFLOPs (@ 1GHz) SIGGRAPH 2009: Beyond Programmable Shading: 5

6 My enthusiast chip! 32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz) SIGGRAPH 2009: Beyond Programmable Shading: 6

11 KAUST King Abdullah University of Science and Technology 11

12 KAUST King Abdullah University of Science and Technology 12

13 KAUST King Abdullah University of Science and Technology 13

18 NVIDIA G80/GT200 Architecture 18 Streaming Processor (SP) Streaming Multiprocessor (SM) Texture/Processing Cluster (TPC) Courtesy AnandTech

19 NVIDIA G80/GT200 Architecture G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs Arithmetic intensity has increased (ALUs vs. texture units) G80 / G92 GT200 Courtesy AnandTech 19

20 NVIDIA GT200 GPGPU Hardware NVIDIA Tesla 10-series Based on GT200 architecture 1 Teraflop / device 4GB RAM / device Tesla C1060 Multiple devices per node / machine Tesla S1070

NVIDIA Fermi / GF100 Hardware Geforce GTX 580 512 CUDA cores (16 SMs) 1.

21 NVIDIA Fermi / GF100 Hardware Geforce GTX CUDA cores (16 SMs) 1.5 GB memory Tesla 20-series Cards: M2070/C2070,... Blades: S2050/S2070 3GB or 6GB / GPU, ECC memory

22 NVIDIA Fermi / GF100 Features Names Compute: Fermi; product: Tesla-20 series Graphics: GF100 (product: Geforce GTX 480, 580,...) Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x html/c/doc/ptx_isa_3.0.pdf toolkit/docs/ptx_isa_2.0.pdf L1 and L2 caches More CUDA cores (up to 512) Faster double precision float performance, faster atomics, float atomics DirectX 11 and OpenGL 4 functionality New shader types, scatter writes to images,... 22

23 NVIDIA Fermi / GF100 Stats 23

24 Streaming Multiprocessor Streaming processors are now CUDA cores 32 CUDA cores per Fermi streaming multiprocessor (SM) 16 SMs = 512 CUDA cores CPU-like cache hierarchy L1 cache / shared memory L2 cache Texture units and caches now in SM (instead of with TPC=multiple SMs in GT200) 24

25 Dual Warp Schedulers Markus Hadwiger, KAUST 25

26 Graphics Processor Clusters (GPC) (instead of TPC on GT200) 4 Streaming Processors 32 CUDA cores / SM 4 SMs / GPC = 128 cores / GPC Decentralized rasterization and geometry 4 raster engines 16 PolyMorph engines 26

27 NVIDIA Fermi / GF100 Structure Full size 4 GPCs 4 SMs each 6 64-bit memory controllers (= 384 bit) 27

28 NVIDIA Fermi / GF100 Die Full size 4 GPCs 4 SMs each 28

29 Compute Capab threads / block More threads / SM 32K registers / SM New synchronization functions 29

30 L1 Cache vs. Shared Memory Two different configs 64KB total 16KB shared, 48KB L1 cache 48KB shared, 16KB L1 cache Set per kernel 30

31 Global Memory Access Cached on Fermi L1 cache per SM Global L2 cache Compile time flag can choose: Caching in both L1 and L2 Caching only in L2 Cache line size (L1, L2): 128 bytes 31

32 NVIDIA Kepler Architecture Two different versions GK104, compute capability 3.0 Geforce GTX 680, Quadro K5000 Tesla K10 series GK110, compute capability 3.5 Geforce GTX Titan (just released!) Tesla K20 series Markus Hadwiger, KAUST 32

33 GF100 Graphics Pipeline Input Assembler? Vertex Shader Hull Shader Tessellator Domain Shader Geometry Shader Stream Output Rasterizer Pixel Shader Output Merger

34 NVIDIA Kepler / GK104 Structure Full size 4 GPCs 2 SMXs each = 8 SMXs, 1536 CUDA cores 34

35 GK104 SMX 192 CUDA cores 32 LD/ST units 16 SFUs 16 texture units Markus Hadwiger, KAUST 35

36 NVIDIA Kepler / GK110 Structure Full size 15 SMXs 2880 CUDA cores 36

37 GK110 SMX 192 CUDA cores 64 DP units 32 LD/ST units 16 SFUs 16 texture units New read-only data cache (48KB) Markus Hadwiger, KAUST 37

38 Compute Capabilities Markus Hadwiger, KAUST 38

39 Maxwell vs. Kepler Architecture GM107 Markus Hadwiger, KAUST 39

40 Maxwell vs. Kepler Architecture GK107 vs. GM107 Markus Hadwiger, KAUST 40

41 Thank you.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in