Around GPGPU: architecture, programming, and arithmetic

Size: px

Start display at page:

Download "Around GPGPU: architecture, programming, and arithmetic"

Iris Barnett
5 years ago
Views:

1 Around GPGPU: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon RAIM'11, Perpignan February 9, 2011

2 Key challenges for parallel architectures In computer architecture how to? Scalability Overcome memory wall and power wall? Programmability Write portable, reusable parallel software with minimal effort? Reliability Tolerate hardware faults? Validate large-scale parallel applications? In computer arithmetic how to? Adjust precision to minimize power and memory transfers? Make error analysis easy, debug floating-point programs? Validate, certify numerical algorithms and results? 2

Why should we care about GPUs? Today Yesterday (2000-2010) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Tomorrow Today (2011-.

3 Why should we care about GPUs? Today Yesterday ( ) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Tomorrow Today ( ) Heterogeneous multi-core Intel Sandy Bridge shipping AMD Fusion sampling NVIDIA Denver/Maxwell project announced Today's GPU = tomorrow's throughput processor Latencyoptimized cores Throughputoptimized cores (Configurable?) hardware accelerators Heterogeneous multi-core chip 3

4 Outline How a GPU works Software: fine-grained vs. coarse-grained threading Hardware: from multi-core to SIMT GPU programming guidelines Arithmetic features 4

5 Threading granularity e.g. In-place vector scaling X a X X Coarse-grained threading Decouple tasks to reduce conflicts and inter-thread communication T0 T1 T2 T3 X[0..3] X[4..7] X[8..11] X[12-15] Fine-grained threading Interleave tasks Exhibit locality: neighbor threads share memory Exhibit regularity: neighbor threads have a similar behavior X T0 T1 T2 T3 X[0] X[4] X[8] X[12] X[1] X[2] X[5] X[6] X[9] X[10] X[13] X[14] X[3] X[7] X[11] X[15] 5

6 Regularity Similarity in behavior between threads Instruction regularity Regular Irregular Thread mul mul mul mul mul add store load add add add add load mul sub add Time Control regularity i=17 i=17 i=17 i=17 switch(i) { case 2:... case 17:... case 21:... } i=21 i=4 i=17 i=2 Memory regularity load X[8] load X[9] load X[10] load X[11] load X[8] load X[0] load X[11] load X[3] X Memory 6

7 Outline How a GPU works Software: fine-grained vs. coarse-grained threading Hardware: from multi-core to SIMT GPU programming guidelines Arithmetic features 7

8 Homogeneous multi-core Replication of the complete execution engine Threads: coarse-grained T0 T2 X Machine code Architecture move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop add i 18 IF store X[17] mul IF ID EX LSU add i 50 IF store X[49] mul IF ID EX LSU Memory Improves throughput thanks to explicit parallelism 8

9 Interleaved multi-threading Time-multiplexing of processing units Threads: fine-grained X T0 T1 T2 T3 Architecture mul mul Fetch Decode Machine code move i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop add i 21 add i 23 load X[16] store X[17] load X[18] store X[19] Execute L/S Unit Memory Improves latency tolerance thanks to explicit parallelism 9

10 Single-instruction, multiple threads (SIMT) Cooperative sharing of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers X Threads: fine-grained T0 T1 T2 T3 (0) mul (0-3) load IF (0-3) store ID (1) mul (2) mul (3) mul EX Memory (0) (1) (2) (3) LSU Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 10

11 Around GPGPU fine-grained multi-thread architectures exploiting regularity: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon RAIM'11, Perpignan February 9, 2011

12 Core 32 Core 18 Core 17 Core 16 Core 2 Core 1 Example GPU: NVIDIA GeForce GTX 580 SIMT: warps of 32 synchronized threads 16 Streaming Multiprocessors / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 3 Warp 2 Warp 4 Warp 47 Warp 48 Time 1580 Gflop/s Up to threads in flight SM1 SM16 12

13 SIMD vs. SIMT Instruction regularity Control regularity Memory regularity SIMD Vectorization at compile-time Software-managed Bit-masking, predication Compiler selects: vector load-store or gather-scatter SIMT Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Hardware-managed Gather-scatter with hardware coalescing SIMT architecture : run regular fine-grained threads on SIMD units SIMT = one possible implementation of SPMD 13

14 GPU design space: not just SIMT How can we execute SPMD threads? MIMD (multi-core) Horizontal SIMT NVIDIA Tesla NVIDIA Fermi Multi-threading (Hyperthreading-like) Vertical SIMT Pipelined vectors (Cray-like) This is the GPU architect's problem Programmer's point of view: just threads Microarchitecture-specific optimizations Or just focus on locality and regularity AMD Evergreen AMD Northern Islands 14

15 Outline How a GPU works GPU programming guidelines Consequences of fine-grained threading Bottlenecks and limitations The more threads, the better? Arithmetic features 15

16 Consequence: SoA data layout Array of Structures (AoS) Good for random access to the whole structure (RGB pixel) struct Pixel { float r, g, b; }; Pixel image_aos[480][640]; Structure of Arrays (SoA) Good for sequential access Efficient access to a subset of fields (only green) With fine-grained threading: brings regularity struct Image { float R[480][640]; float G[480][640]; float B[480][640]; }; Image image_soa; 16

17 Case study: call stack Coarse-grained (x86 ) Split stacks Fine-grained (Fermi) Interleaved stacks T0 T1 SP T0 T1 T2 T3 SP Cache line T2 T3 Memory Memory No false sharing Good for coherent mem Regularity cache way conflicts [MengSka09] False sharing Good for spacial locality! 17

18 Outline How a GPU works GPU programming guidelines Consequences of fine-grained threading Bottlenecks and limitations The more threads, the better? Arithmetic features 18

19 Our #1 constraint is power Power measurements on NVIDIA GT200 [CDT10] Energy/op (nj) Total power (W) Instruction control Multiply-add on element vector Load 128B from DRAM With the same amount of energy Load 1 word from DRAM Compute 44 flops Need to keep memory traffic low Standard solution: caches, on-chip memory 19

20 On-chip memory on GPUs Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide Actual data GPU NVIDIA GF110 AMD Cayman Register files + caches 3.9 MB 7.7 MB At this rate, will catch up with CPUs by

21 Little's law: data=throughput latency Throughput (GB/s) Intel Core i , Latency (ns) NVIDIA GeForce GTX 580 L1 L2 320 DRAM ns21

22 Outline How a GPU works GPU programming guidelines Consequences of fine-grained threading Bottlenecks and limitations The more threads, the better? Arithmetic features 22

23 As many threads as possible? Example: SGEMM from CUBLAS 1.1 From: Vasily Volkov. Programming inverse memory hierarchy : case of stencils on GPUs. ParCFD, Addresses, indices (affine, linear increase) Duplicated data (uniform) Useful data 15 registers / thread Temporary data 512 threads / CTA 15 registers/thread Only 2 registers/thread really needed: 87% wastage 23

24 Overhead of SIMT Run as many threads as possible? Maximal parallelism better latency tolerance Less locality: need to keep private data of each thread More thread management overhead Initialization, redundant operations Instruction-Level Parallelism is still alive Out-of-order completion of load/stores on Tesla, Fermi Superscalar (supervector?) execution on GF104 VLIW on AMD architectures Thread count is not increasing, flops/thread is GPU (year) Threads MFlops/thread GT200 (2008) GF110 (2010)

25 Fewer threads, more computations Volkov SGEMM 8 elements computed / thread Unrolled loops Less traffic through shared memory, more through registers Overhead amortized 1920 registers vs for the same amount of work Works for redundant computations too Success story +60% compared to CUBLAS 1.1 Adopted in CUBLAS 2.0 More in [Volk10] Adresses, indices Useful data Duplicated data Temporary data 64 threads / CTA 30 registers / thread 25

26 Takeaway Distribute work and data Favor SoA Favor locality and regularity Use common sense (avoid extraneous copies or indirections) More threads higher performance Saturate instruction-level parallelism first (almost free) Then use data parallelism (expensive in terms of locality) Compiler optimization: unroll & jam And/or hardware-based solutions flexible register allocation data deduplication [CDZ09] 26

27 Outline How a GPU works GPU programming guidelines Arithmetic features IEEE-754? A bit of history FP capabilities 27

28 Every new generation is now IEEE-754 The Vector Unit floating-point precision is approximately IEEE. E. Lindholm, M. Kilgard, H. Moreton, A user-programmable vertex engine, SIGGRAPH, 2001 The vector unit can perform four IEEE single-precision multiply, add, or multiply-add operations, as well as inner products, max, min, and so on. J. Montrym, H. Moreton, The GeForce 6800, IEEE Micro, 2005 The floating-point add and multiply operations are compatible with the IEEE 754 standard for single-precision FP numbers, including not-a-number (NaN) and infinity values. Erik Lindholm et al., NVIDIA Tesla: a unified graphics and computing architecture, IEEE Micro, 2008 Single precision floating point instructions now support subnormal numbers by default in hardware, as well as all four IEEE rounding modes (nearest, zero, positive infinity, and negative infinity). NVIDIA's next generation CUDA compute architecture: Fermi Whitepaper, 2009 All compute devices follow the IEEE standard for binary floating-point arithmetic with the following deviations: [ 2-page long bullet list ] NVIDIA CUDA C Programming Guide,

29 Some partial, recent GPU history Microsoft DirectX 7.x a 9.0b 9.0c NVIDIA NV10 NV20 NV30 NV40 G70 G80-G90 GT200 GF100 FP 16 Programmable FP 32 shaders Dynamic control flow SIMT? CUDA ATI/AMD FP 24 CTM FP 64 CAL R100 R200 R300 R400 R500 R600 R700 Evergreen NI GPGPU traction

30 Arithmetic features 2006 (ATI R500, NVIDIA G70): Cray-1-like FP Truncated multipliers, adders with 2 guard bits and no sticky 41 / 41 1 Same GPU, different units: different behavior 2007 (ATI R600, NVIDIA G80) Correct IEEE-754 rounding to the nearest for +, Integer arithmetic and logical ops 2008 (AMD R670, NVIDIA GT200) Binary (AMD Evergreen, NVIDIA GF100) 4 mandatory IEEE rounding modes FMA for both Binary32 and Binary64 Subnormals at full-speed 30

31 Hardware elementary functions We therefore conclude that (1) the entire function library should be included in the hardware if and only if COMPLEX data types and their corresponding arithmetic are formally introduced; (2) the following error/accuracy criterion should be adopted and met by the implementation: [Correct rounding]. If either of these conditions is not met, then none of the elementary functions should be included in the hardware. G. Paul, M.W. Wilson, Should the elementary function library be incorporated into computer instruction sets?, TOMS, years later: still no complex datatypes nor correct rounding of elementary functions But we have hardware elementary functions on GPUs 1/x, 1/ x, log 2, 2 x, sin, cos Accuracy: 22 to 23 bits Applications: graphics, physics, finance 31

32 Graphics is bandwidth-starved too Lower-precision format: Binary16 11-bit significand, 5-bit exponent In IEEE-754:2008 Back to the 50's: Block Floating-Point formats One shared exponent, multiple significands More compact storage for correlated FP data 1, , , , m 1 m 2 m 3 m 4 e f 1 =m 1 x2 e f 2 =m 2 x2 e f 3 =m 3 x2 e f 4 =m 4 x2 e Lossy compression of textures in memory Hardware-based on-the-fly decompression Lossless compression of frame buffer, depth buffer 32

33 Fused multiply-add (FMA) (a b+c) with only one rounding error Higher accuracy a b Error-free transformations Different behavior than a b+c Loss of commutativity (+ in dot product ) + c a b a b+c R(a b+c) On all recent GPU Reasons GPUs do lots of linear algebra Easier to implement 4-operand instructions than on CPUs Used for IEEE-compliant division 33

34 Full-speed subnormals On CPUs, no hardware support for subnormals Micro-coded exception handler (x86 ) Software exception handler (Alpha ) On GPUs, subnormals in hardware at full-speed NV GT200 in binary64 NV Fermi AMD Evergreen Reasons Producing output denormals is free with an FMA (Cell SPU) GPU pipelines are latency-tolerant (GF100: 16-cycle FMA) Exception handling breaks regularity inefficient in SIMT 34

35 Static rounding attributes On CPUs Rounding mode as a mode for each thread Get/set with e.g. fegetround() and fesetround() On NVIDIA GPUs Rounding direction: flag in the instruction word Benefit: zero-overhead mode switch Reason: no legacy numeric code on GPUs Applications Interval arithmetic Interval CUDA SDK sample 100 speedup for the same development effort Stochastic arithmetic [JL10] 35

36 AMD Cypress floating-point unit 36

37 AMD Cypress floating-point unit Example Operation 4 FMA Rx Ax Bx + Cx Ry Ay By + Cy 4 UMA Rx (Ax Bx) + Cx Ry (Ay By) + Cy UDOT4 Rx (Ax Bx) + (Ay By) + (Az Bz) + (Aw Bw) 2 UDOT2 Rx (Ax Bx) + (Ay By) Rz (Az Bz) + (Aw Bw) 2 UMFMA Rx (Bx Az) Bz + Cx Ry (By Aw) Bw + Cy FMA Binary64 Rxy Axy Bxy + Cxy 37

38 Next arithmetic features? Arithmetic exceptions, traps, flags Easier to implement on a GPU than on a CPU Bullet feature for IEEE-754-compliance Market demand? Speed? Fused operations (FDOT2, FADD2 ) Fused dot product a b+c d: natural evolution of AMD's FPU Not in IEEE-754: not a drop-in replacement for unfused ops Parallel long accumulators 1 accum per thread: cannot fit on-chip with ~25000 threads Shared long accumulator In software using atomic instructions? Hardware assist for high-radix carry-save, conflict resolution? 38

39 Conclusion Software side Fine-grained thread parallelism Regularity Specialized in FP arithmetic From 8-bit fixed point to IEEE-754:2008 in 10 years Now better FP support than on most CPUs Specialized in graphics Exotic arithmetic units Can HPC learn from computer graphics? Fixed-function units, memory compression? Driven mostly by economic forces 39

40 FP OXO Format FMA UMA Rounding Subnormals Inf, NaN Flags Exceptions Intel X86 80 bit 4 Dynamic Microcode IBM PowerPC 64 bit 4 Dyn. Microcode Intel IA bit 4 Dyn. + Stat. Microcode 32 bit RZ IBM Cell SPU 64 bit 4 Dyn. Output NVIDIA 32 bit 2 Static GT bit 4 Static 32 bit RN AMD RV bit RN 32 bit 4 Static NVIDIA GF100 AMD Evergreen 64 bit 4 Static 32 bit 4 Dyn. 64 bit 4 Dyn. OpenCL 1.1 Direct3D bit Opt N/A RN Opt 64 bit 4 StaticRN 32 bit RN 64 bit RN 40

41 References [MengSka09] Jiayuan Meng, Kevin Skadron. Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. ICCD [Glew09] Andy Glew. Coherent vector lane threading. Berkeley ParLab Seminar, [CDT09] Sylvain Collange, David Defour, Arnaud Tisserand. Power consumption of GPUs from a software perspective. ICCS [Volk10] Vasily Volkov. Better Performance at Lower Occupancy. GTC [CDZ09] Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009 [JL10] Fabienne Jezequel, Jean-Luc Lamotte. Numerical validation of Slater integrals computation on GPU. SCAN

Around GPGPU: architecture, programming, and arithmetic

Around GPGPU: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon David Defour, DALI, ELIAUS, Université de Perpignan November 10, 2010 Key challenges for parallel architectures