Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Size: px

Start display at page:

Download "Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique"

Hannah Johnston
6 years ago
Views:

1 Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique

2 Outline of the course Feb 29: Introduction to GPU architecture Let's pretend we are a GPU architect Performance models March 7?: GPU programming The software side Programming model March 14: Performance optimization Possible bottlenecks Common optimization techniques 4 labs

computation on GPU (GPGPU) 2016: the 2 fastest supercomputers use GPUs/accelerators #1 Tianhe-2

3 GPGPU in HPC, today Graphics Processing Unit (GPU) Made for computer games: high-volume market Low unit price, amortized R&D Inexpensive, high-performance parallel processor 2002: General-Purpose computation on GPU (GPGPU) 2016: the 2 fastest supercomputers use GPUs/accelerators #1 Tianhe-2 (China) #2 Titan (USA) 18,648 (1 AMD Opteron + 1 NVIDIA Tesla K20X) 16,000 (2 Intel Xeon + 3 Intel Xeon Phi) 3

Processing Unit (GPU) Many embedded SoCs Intel Sandy Bridge AMD Fusion

4 GPGPU in the future? Yesterday ( ) Homogeneous multi-core Discrete components Today ( ) Chip-level integration Central Processing Unit (CPU) Graphics Processing Unit (GPU) Many embedded SoCs Intel Sandy Bridge AMD Fusion NVIDIA Denver/Maxwell project Tomorrow Heterogeneous multi-core GPUs to blend into throughput-optimized cores? Latencyoptimized cores Throughputoptimized cores Hardware accelerators Heterogeneous multi-core chip 4

5 Today: GPU internals What makes a GPU tick? NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering! 5

6 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 6

7 The free lunch era... was yesterday 1980's to 2002: Moore's law, Dennard scaling, micro-architecture improvements Exponential performance increase Software compatibility preserved Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed Do not rewrite software, buy a new machine! 7

8 Technology evolution Memory wall Compute Performance Gap Memory Memory speed does not increase as fast as computing speed Harder to hide memory latency Time Power wall Power consumption of transistors does not decrease as fast as density increases Transistor density Performance is now limited by power consumption Transistor power ILP wall Law of diminishing returns on Instruction-Level Parallelism Total power Time Cost Pollack rule: cost performance² Serial performance 8

computing devices are power-constrained Laptops, cell phones, tablets

9 Usage changes New applications demand parallel processing Computer games : 3D graphics Search engines, social networks big data processing New computing devices are power-constrained Laptops, cell phones, tablets Small, light, battery-powered Datacenters High power supply and cooling costs 9

10 Latency vs. throughput Latency: time to solution Minimize time, at the expense of power Metric: time e.g. seconds Throughput: quantity of tasks processed per unit of time Assumes unlimited parallelism Minimize energy per operation Metric: operations / time e.g. Gflops / s CPU: optimized for latency GPU: optimized for throughput 10

11 Amdahl's law Bounds speedup attainable on a parallel machine 1 S= Time to run sequential portions 1 P P N Time to run parallel portions S P N Speedup Ratio of parallel portions Number of processors S (speedup) N (available processors) G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. AFIPS

12 Why heterogeneous architectures? Time to run sequential portions S= 1 P 1 P N Time to run parallel portions Latency-optimized multi-core (CPU) Low efficiency on parallel portions: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential portions Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Allows aggressive optimization for latency or for throughput M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer,

13 Example: System on Chip for smartphone Small cores for background activity GPU Big cores for applications Lots of interfaces Special-purpose accelerators 13

14 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 14

15 The (simplest) graphics rendering pipeline Primitives (triangles ) Vertices Fragment shader Vertex shader Textures Z-Compare Blending Clipping, Rasterization Attribute interpolation Pixels Fragments Framebuffer Z-Buffer Programmable stage Parametrizable stage 15

16 How much performance do we need to run 3DMark 11 at 50 frames/second? Element Per frame Per second Vertices 12.0M 600M Primitives 12.6M 630M Fragments 180M 9.0G Instructions 14.4G 720G Intel Core i7 2700K: 56 Ginsn/s peak We need to go 13x faster Make a special-purpose accelerator Source: Damien Triolet, Hardware.fr 16

17 Beginnings of GPGPU Microsoft DirectX 7.x a 9.0b 9.0c Unified shaders NVIDIA NV10 FP 16 NV20 NV30 Programmable shaders Dynamic control flow G70 R200 G80-G90 CTM R300 R400 GT200 GF100 CUDA SIMT FP 24 ATI/AMD R100 FP 32 NV40 R500 FP 64 R600 CAL R700 Evergreen GPGPU traction

18 Today: what do we need GPUs for? 1. 3D graphics rendering for games Complex texture mapping, lighting computations 2. Computer Aided Design workstations Complex geometry 3. GPGPU Complex synchronization, data movements One chip to rule them all Find the common denominator 18

19 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 19

20 What is parallelism? Parallelism: independent operations whose execution can be overlapped Operations: memory accesses or computations How much parallelism do I need? Little's law in queuing theory Average customer arrival rate λ Average time spent W Average number of customers: L = λ W Parallelism = throughput latency Units: Memory: B = GB/s ns Arithmetic: flops = Gflops/s ns J. Little. A proof for the queuing formula L= λ W. JSTOR

21 Throughput and latency: CPU vs. GPU Throughput (GB/s) CPU memory: Core i7 4790, DDR3-1600, 2 channels 25.6 Latency (ns) 67 GPU memory: NVIDIA GeForce GTX 980, GDDR5-7010, 256-bit 224 x8 Area: 56 x6 Need 56 times more parallelism! 410 ns 21

22 Sources of parallelism ILP: Instruction-Level Parallelism Between independent instructions in sequential program TLP: Thread-Level Parallelism Between independent execution contexts: threads DLP: Data-Level Parallelism Between elements of a vector: same operation on several elements add r3 r1, r2 Parallel mul r0 r0, r1 sub r1 r3, r0 Thread 1 add Thread 2 mul vadd r a,b Parallel a1 + b1 a2 + b2 a3 + b3 r1 r2 r3 24

23 Example: X a X In-place scalar-vector product: X a X Sequential (ILP) For i = 0 to n-1 do: X[i] a * X[i] Threads (TLP) Launch n threads: X[tid] a * X[tid] Vector (DLP) X a * X Or any combination of the above 25

24 Uses of parallelism Horizontal parallelism for throughput A B C D More units working in parallel Vertical parallelism for latency hiding A Pipelining: keep units busy when waiting for dependencies, memory B C D A B C A B latency throughput A cycle 1 cycle 2 cycle 3 cycle 4 26

25 How to extract parallelism? Horizontal Vertical ILP Superscalar Pipelined TLP Multi-core SMT Interleaved / switch-on-event multithreading DLP SIMD / SIMT Vector / temporal SIMT We have seen the first row: ILP We will now review techniques for the next rows: TLP, DLP 27

26 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 28

27 Sequential processor for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute Memory Memory Sequential CPU Focuses on instruction-level parallelism Exploits ILP: vertically (pipelining) and horizontally (superscalar) 29

Area: benefits from Moore's law Power: extra cores consume

28 The incremental approach: multi-core Several processors on a single chip sharing one memory space Intel Sandy Bridge Area: benefits from Moore's law Power: extra cores consume little when not in use e.g. Intel Turbo Boost Source: Intel 30

29 Homogeneous multi-core Horizontal use of thread-level parallelism add i 50 F IF store X[49] D ID mul mul EX X LSU Mem Threads: T0 EX X Memory add i 18 F IF store X[17] D ID LSU Mem T1 Improves peak throughput 31

30 Example: Tilera Tile-GX Grid of (up to) 72 tiles Each tile: 3-way VLIW processor, 5 pipeline stages, 1.2 GHz Tile (1,1) Tile (1,2) Tile (1,8) Tile (9,1) Tile (9,8) 32

31 Interleaved multi-threading Vertical use of thread-level parallelism Threads: T0 T1 T2 T3 mul Fetch mul Decode add i 73 add i 50 Execute load X[89] store X[72] load X[17] store X[49] Memory load-store unit Memory Hides latency thanks to explicit parallelism improves achieved throughput 33

32 Example: Oracle Sparc T5 16 cores / chip Core: out-of-order superscalar, 8 threads 15 pipeline stages, 3.6 GHz Thread 1 Thread 2 Thread 8 Core 1 Core 2 Core 16 34

33 Clustered multi-core For each individual unit, select between T0 T1 T2 Cluster 1 T3 Cluster 2 Horizontal replication Vertical time-multiplexing br Examples mul Sun UltraSparc T2, T3 AMD Bulldozer IBM Power 7 add i 73 Fetch store Decode mul add i 50 load X[89] store X[72] load X[17] store X[49] EX Memory L/S Unit Area-efficient tradeoff Blurs boundaries between cores 35

34 Implicit SIMD Factorization of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers (0-3) load F T1 (0-3) store D T2 T3 (0) mul (0) (1) mul (2) mul (1) (2) (3) mul X (3) Memory T0 Mem In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Extracts DLP from multi-thread applications 36

35 Explicit SIMD Single Instruction Multiple Data Horizontal use of data level parallelism add i 20 F vstore X[ D vmul X Memory loop: vload T X[i] vmul T a T vstore X[i] T add i i+4 branch i<n? loop Mem Machine code SIMD CPU Examples Intel MIC (16-wide) AMD GCN GPU (16-wide 4-deep) Most general purpose CPUs (4-wide to 8-wide) 37

36 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 38

37 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 39

38 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 40

39 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 41

40 Example GPU: NVIDIA GeForce GTX 580 Combines all techniques SIMT: warps of 32 threads 16 SMs / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 2 Warp 3 Warp 4 Core 32 Core 18 Time Core 17 Warp 47 Core 16 Core 2 Core 1 Warp 48 SM1 SM16 Up to 512 operations per cycle from threads in flight 44

41 Taxonomy of parallel architectures Horizontal Vertical ILP Superscalar / VLIW Pipelined TLP Multi-core SMT Interleaved / switch-onevent multithreading DLP SIMD / SIMT Vector / temporal SIMT 45

42 Takeaway Parallelism for throughput and latency hiding Types of parallelism: ILP, TLP, DLP All modern processors exploit the 3 kinds of parallelism GPUs focus on Thread-level and Data-level parallelism 49

43 Outline Computer architecture crash course The simplest processor Exploiting instruction-level parallelism GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory 50

44 External memory: discrete GPU Classical CPU-GPU model Split memory spaces Need to transfer data explicitely Highest bandwidth from GPU memory PCI Express CPU 26GB/s Transfers to main memory are slower Main memory 8 GB 16GB/s GPU 224GB/s Graphics memory 4 GB Ex: Intel Core i7 4790, Nvidia GeForce GTX

45 External memory: embedded GPU Most GPUs today are integrated Same memory May support memory coherence GPU can read directly from CPU caches More contention on external memory CPU GPU Cache 26GB/s Main memory 8 GB 52

46 CPU: caches for latency Large memories are slower than small memories Computer theorists are liars: in the real world, access in an array of size n costs O(log n), not O(1)! Think about looking up a book in a small or huge library Idea: put frequently-accessed data in small, fast memory Can be applied recursively: hierarchy with multiple levels of cache CPU Registers L1 cache L2 cache Capacity: 4 KB Access time: 0 (fully-bypassed) Capacity: 64 KB Access time: 2 ns 1 MB 10 ns L3 cache 8 MB 30 ns Memory 8 GB 60 ns 55

47 GPU: caches for throughput Cache hierarchy Keep frequently-accessed data SM SM Reduce throughput demand on main memory L1 SMem SM ~2 TB/s L1 SMem L1 Smem 1 MB Managed by hardware (L1, L2) or software (shared memory) GPU Crossbar L2 L2 L2 6 MB 290 GB/s GPU memory 56

48 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 58

49 First-order performance model Questions you should ask yourself, before starting to code or optimize Will my code run faster on the GPU? Is my existing code running as fast as it should? Is performance limited by computations or memory bandwidth? Pen-and-pencil calculations can (often) answer such questions 59

50 Performance: metrics and definitions Optimistic evaluation: upper bound on performance Assume perfect overlap of computations and memory accesses Memory accesses: bytes Only external memory, not caches or registers Computations: flops Only useful computations (usually floating-point) not address calculations, loop iterators.. Arithmetic intensity: flops / bytes = computations / memory accesses Property of the code Arithmetic throughput: flops / s Property of code + architecture 60

51 The roofline model How much performance can I get for a given arithmetic intensity? Upper bound on arithmetic throughput, as a function of arithmetic intensity Property of the architecture Arithmetic throughput (Gflops/s) Arithmetic intensity (flops/byte) Memory-bound Compute-bound S. Williams, A. Waterman, D. Patterson. Roofline: an insightful visual performance model 64 for multicore architectures. Communications of the ACM, 2009

52 Building the machine model Compute or measure: Peak memory throughput GTX 980: 224 GB/s Ideal arithmetic intensity = peak compute throughput / mem throughput GTX 980: 6412 (Gflops/s) / 224 (GB/s) = 28.6 flops/b 4 (B/flop) = 114 (dimensionless) Arithmetic throughput (Gflops/s) 4612 Gflops 28.6 B/flop Beware of units: float=4b, double=8b! Arithmetic intensity (flops/byte) Achievable peaks may be lower than theoretical peaks Lower curves when adding realistic constraints 65

53 Using the model Compute arithmetic intensity, measure performance of program Identify bottleneck: memory or computation Take optimization decision Measured performance Optimize computation Optimize memory accesses Arithmetic throughput (Gflops/s) Reuse data Arithmetic intensity (flops/byte) 66

54 Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? How many memory accesses? Arithmetic intensity? Compute-bound or memory-bound? How many Gflop/s on a GTX 980 GPU? With data in GPU memory? With data in CPU memory? How many Gflop/s on an i CPU? GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 67

55 Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? 2 n flops How many memory accesses? 2 n words Arithmetic intensity? 1 flop/word = 0.25 flop/b Compute-bound or memory-bound? Highly memory-bound How many Gflop/s on a GTX 980 GPU? With data in GPU memory? 224 GB/s 0.25 flop/b 56 Gflops/s With data in CPU memory? 16 GB/s 0.25 flop/b 4 Gflops/s How many Gflop/s on an i CPU? 25.6 GB/s 0.25 flop/b 6.4 Gflops/s Conclusion: don't bother porting to GPU! GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 68

56 Takeaway Result of many tradeoffs Between locality and parallelism Between core complexity and interconnect complexity GPU optimized for throughput Exploits primarily DLP, TLP Energy-efficient on parallel applications with regular behavior CPU optimized for latency Exploits primarily ILP Can use TLP and DLP when available Performance models Back-of-the-envelope calculations and common sense can save time Next time: GPU programming in CUDA 69

57 Computer architecture crash course How does a processor work? Or was working in the 1980s to 1990s: modern processors are much more complicated! An attempt to sum up 30 years of research in 15 minutes 70

58 Machine language: instruction set Registers Example State for computations Keeps variables and temporaries Instructions R0, R1, R2, R3... R31 Perform computations on registers, move data between register and memory, branch Instruction word Binary representation of an instruction Assembly language Readable form of machine language ADD R1, R3 Examples Which instruction set does your laptop/desktop run? Your cell phone? 71

59 The Von Neumann processor PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus State machine Let's look at it step by step 72

60 Step by step: Fetch The processor maintains a Program Counter (PC) Fetch: read the instruction word pointed by PC in memory PC Fetch unit Instruction word Memory 73

61 Decode Split the instruction word to understand what it represents Which operation? ADD Which operands? R1, R3 PC Fetch unit Instruction word Decoder Operands R1, R3 Operation Memory ADD 74

62 Read operands Get the value of registers R1, R3 from the register file PC Fetch unit Instruction word Decoder Operands R1, R3 Register file Memory 42, 17 75

63 Execute operation Compute the result: PC Fetch unit Instruction word Decoder Operands Operation Memory Register file 42, 17 ADD Arithmetic and Logic Unit 59 76

64 Write back Write the result back to the register file PC Fetch unit Instruction word Decoder R1 Operands 59 Operation Memory Register file Arithmetic and Logic Unit Result bus 77

65 Increment PC PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Result bus 78

66 Load or store instruction Can read and write memory from a computed address PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Load/Store Unit Result bus 79

67 Branch instruction Instead of incrementing PC, set it to a computed value PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus 80

68 What about the state machine? The state machine controls everybody Sequences the successive steps Send signals to units depending on current state At every clock tick, switch to next state Clock is a periodic signal (as fast as possible) Fetchdecode Read operands Increment PC Read operands Writeback Execute 81

69 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 82

70 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 2: mul 1: add Execute Writeback Independent instructions can follow each other Exploits ILP to hide instruction latency 83

71 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 3: load 2: mul Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 84

72 Superscalar execution Multiple execution units in parallel Independent instructions can execute at the same time Fetch Decode Execute Writeback Decode Execute Writeback Decode Execute Writeback Exploits ILP to increase throughput 85

73 Locality Time to access main memory: ~200 clock cycles One memory access every few instructions Are we doomed? Fortunately: principle of locality ~90% of memory accesses on ~10% of data Accessed locations are often the same Temporal locality Access the same location at different times Spacial locality Access locations close to each other 86

74 Caches Large memories are slower than small memories The computer theorists lied to you: in the real world, access in an array of size n costs O(log n), not O(1)! Think about looking up a book in a small or huge library Idea: put frequently-accessed data in small, fast memory Can be applied recursively: hierarchy with multiple levels of cache L1 cache L2 cache Capacity: 64 KB Access time: 2 ns 1 MB 10 ns L3 cache 8 MB 30 ns Memory 8 GB 60 ns 87

75 Flynn's taxonomy Classification of parallel architectures Single Instruction F Single Data X SISD Multiple Data F Multiple Instruction FFFF X MISD FFFF XXXX XXXX SIMD MIMD M. Flynn. Some Computer Organizations and Their Effectiveness. IEEE TC

Parallel programming: Introduction to GPU architecture

Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 Outline of the course March