Improving the efficiency of parallel architectures with regularity

Size: px

Start display at page:

Download "Improving the efficiency of parallel architectures with regularity"

Giles Elijah Ray
5 years ago
Views:

1 Improving the efficiency of parallel architectures with regularity Sylvain Collange Arénaire, LIP, ENS de Lyon Dipartimento di Ingegneria dell'informazione Università di Siena February 2, 2011

Where I come from Arénaire, LIP, École normale supérieure de Lyon 11 faculty members, 17 PhD/postdoc/staff Computer arithmetic, at all levels Validated algorithms Function approximation in software

2 Where I come from Arénaire, LIP, École normale supérieure de Lyon 11 faculty members, 17 PhD/postdoc/staff Computer arithmetic, at all levels Validated algorithms Function approximation in software Hardware operators on FPGAs DALI, LIRMM, Université de Perpignan 7 faculty members, 6 PhD/postdoc Computer arithmetic Compensation, certification Computer architecture Simulation, micro-architecture, GPUs 2

Context Accelerate compute-intensive applications HPC: computational fluid dynamics, seismic imaging, DNA folding, phylogenetics Multimedia: 3D rendering, video, image processing Current constraints

3 Context Accelerate compute-intensive applications HPC: computational fluid dynamics, seismic imaging, DNA folding, phylogenetics Multimedia: 3D rendering, video, image processing Current constraints Power consumption Cost of moving and retaining data Focus on GPU Video game: mass market Economy of scale, amortized R&D High-performance parallel processor For much more than graphics Now in #1 supercomputer (Tianhe-1A) 3

Why should we care about GPUs? Yesterday (...-2010) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Today (2011-.

4 Why should we care about GPUs? Yesterday ( ) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Today ( ) Heterogeneous multi-core Intel Sandy Bridge shipping AMD Fusion sampling NVIDIA Denver/Maxwell project announced Throughputoptimized cores Latency(Configurable?) optimized hardware cores accelerators Heterogeneous multi-core chip 4

5 Why should we care about GPUs? Yesterday ( ) Homogeneous multi-core Discrete components Superscalar CPU GPU ASIC FPGA Today ( ) Heterogeneous multi-core Intel Sandy Bridge shipping AMD Fusion sampling NVIDIA Denver/Maxwell project announced Throughputoptimized cores Latency(Configurable?) optimized hardware cores accelerators Heterogeneous multi-core chip 5

6 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 6

7 Software view Programming model Threads SPMD: Single program, multiple data One kernel code, many threads Unspecified execution order between explicit synchronization barriers Languages Graphics shaders : HLSL, Cg, GLSL Barrier GPGPU : C for CUDA, OpenCL Example: in-place scalar-vector product X a X For n threads: X[tid] a * X[tid] Kernel Let's build a computer to run these programs 7

8 st 1 step: sequential processor Sequential execution move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop add i 18 Fetch store X[17] Decode mul Execute Memory L/S Unit Machine code Can exploit implicit instruction-level parallelism to increase throughput Obstacles: not scalable David Patterson (UCBerkeley): Power Wall + Memory Wall + ILP Wall = Brick Wall 8

9 2 nd step: homogeneous multi-core Replication of the complete execution engine Software: coarse-grained threading Threads: T0 add i 18 IF IF store X[17] ID add i 50 IF IF store X[49] ID mul mul EX LSU EX Memory move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code LSU T1 Improves throughput thanks to explicit parallelism 9

10 rd 3 step: interleaved multi-threading Time-multiplexing of processing units Software: coarse-grained or fine-grained threading move i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop Machine code (fine-grained threading) Threads: T0 T1 T2 mul Fetch mul Decode add i 73 add i 50 Execute load X[89] store X[72] load X[17] store X[49] L/S Unit Memory T3 Hides latency thanks to explicit parallelism 10

11 th 4 step: clustered multi-core For each individual unit, select between T0 T1 T2 Cluster 1 T3 Cluster 2 Replication Time-multiplexing Examples br mul Sun UltraSparc T2 AMD Bulldozer add i 73 Fetch store Decode mul add i 50 load X[89] store X[72] load X[17] store X[49] EX Memory L/S Unit Area-efficient tradeoff 11

12 Regularity Similarity in behavior between threads Irregular Regular Instruction regularity 4 mul add Control regularity i=17 i=17 i=17 Memory regularity load X[8] i=17 Time 1 mul add Thread 2 3 mul mul add add 1 mul load 2 add mul 3 4 store load sub add i=21 i=4 i=17 i=2 load X[8] load X[0] load X[11] load X[3] switch(i) { case 2:... case 17:... case 21:... } X load X[9] Memory load load X[10] X[11] 12

13 Final step: SIMT Cooperative sharing of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers (0-3) load IF T1 (0-3) store ID T2 T3 (0) mul (0) (1) mul (2) mul (1) (2) (3) mul EX (3) Memory T0 LSU In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 13

14 So how does it differ from SIMD? SIMD SIMT Instruction regularity Control regularity Vectorization at compile-time Software-managed Bit-masking, predication Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Memory regularity Compiler selects: Hardware-managed vector load-store or Gather-scatter with gather-scatter hardware coalescing SIMD is static, SIMT is dynamic Similar opposition as VLIW vs. superscalar 14

15 Example GPU: NVIDIA GeForce GTX 580 SIMT: warps of 32 threads 16 SMs / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 2 Warp 3 Warp 4 Core 32 Time Core 18 Warp 47 Core 17 Core 16 Core 2 Core 1 Warp 48 SM1 SM Gflop/s Up to threads in flight 15

16 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 16

17 Knowing our baseline Design and run micro-benchmarks Target NVIDIA Tesla architecture Go far beyond published specifications Understand design decisions Run power studies [CDA09] Energy measurements on micro-benchmarks Understand power constraints 17

Barra Functional GPU simulator [CDDP10] Modeled after NVIDIA

execution Built within the Unisim framework Fast Uses SSE

floating-point Produces low-level statistics Allows

18 Barra Functional GPU simulator [CDDP10] Modeled after NVIDIA Tesla GPUs Executes native CUDA binaries Reproduces SIMT execution Built within the Unisim framework Fast Uses SSE instructions and multi-threading Accurate Bit-accurate floating-point Produces low-level statistics Allows experimenting with architecture changes 18

19 Primary constraint: power Power measurements on NVIDIA GT200 [CDT10] Energy/op (nj) Instruction control Multiply-add on 32-element vector Load 128B from DRAM Total power (W) With the same amount of energy Load 1 word from DRAM Compute 44 flops Need to keep memory traffic low Standard solution: caches 19

20 On-chip memory Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide: Actual data GPU NVIDIA GF110 AMD Cayman Register files + caches 3.9 MB 7.7 MB At this rate, will catch up with CPUs by

21 The cost of SIMT: register wastage SIMD mov loop: vload vmul vstore add branch SIMT i 0 mov i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop T X[i] T a T X[i] T i i+16 i<n? loop Instructions vload vmul vstore add branch scalar Thread SIMD load mul store add branch Registers T a 17 i 0 n 51 scalar vector t a i n

22 SIMD vs. SIMT SIMD SIMT Instruction regularity Control regularity Vectorization at compile-time Software-managed Bit-masking, predication Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Memory regularity Compiler selects: Hardware-managed vector load-store or Gather-scatter with gather-scatter hardware coalescing Scalar registers, Duplicated registers, scalar instructions duplicated ops Data regularity 22

23 Uniform and affine vectors warp (granularity) Uniform vector Value does not depend on lane ID thread In a warp, v[i] = c c=5 Affine vector In a warp, v[i] = b + i s Base b, stride s Affine relation between value and lane ID c= s=1 b=8 s=2 b= Generic vector : anything else 23

24 How many uniform / affine vectors? Analysis by simulation with Barra Applications from the CUDA SDK Granularity 16 threads Generic Outputs Affine Uniform Inputs 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 49% of all reads from the register file are affine 32% of all instructions compute on affine vectors This is what we have in the wild How to capture it? 24

25 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 25

26 Tagging registers Instructions Associate a tag to each vector register Uniform, Affine, unknown 2 lanes are enough to encode uniform and affine vectors Trace Propagate tags across arithmetic instructions mov i tid loop: load t X[i] mul t a t store X[i] t add i i+tcnt branch i<n? loop loop: load t X[i] mul t a t... Tag K U A U Tags A A K U[A] K U K U[A] K A A+U A<U? K U[A] K U K Thread t a 17 X X X X i 0 1 X X X n 51 X X X X X X X 26

27 Dynamic data deduplication: results Detects 79% of affine input operands 75% of affine computations Unknown Outputs detected Affine Uniform Outputs total Inputs detected Inputs total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 27

28 New pipeline New deduplication stage In parallel with predication control stage Split RF banks into scalar part and vector part Fine-grained clock-gating on vector RF and SIMD datapaths Cooperative sharing of execution units Tags Reg ID Fetch Scalar Vector RF RF Reg ID + tag Deduplication Decode Branch / Mask Read operands Execute... 28

29 Power savings Inactive for 38% of operands Tags Reg ID Fetch Scalar Vector RF RF Reg ID + tag Deduplication Decode Branch / Mask Inactive for 24% of instructions Read operands Execute... S. Collange, D. Defour, Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09,

30 SIMD vs. SIMT SIMD SIMT Instruction regularity Control regularity Vectorization at compile-time Software-managed Bit-masking, predication Vectorization at runtime Hardware-managed Stack, counters, multiple PCs Memory regularity Compiler selects: Hardware-managed vector load-store or Gather-scatter with gather-scatter hardware coalescing Scalar registers, Data deduplication scalar instructions at runtime Data regularity 30

31 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 31

32 A scalarizing compiler? Scalar-only Sc a co lariz mp i n ile g r Actual CPU Scalar+SIMD SPMD compiler Architectures Model CPU Scalar SPMD (CUDA, OpenCL) ing riz cto iler Ve omp c Sequential programming Traditional compiler Programming models Vector-only GPU SIMT 32

33 Still uniform and affine vectors Scalar registers c=a+b a b Uniform vectors Affine vectors with known stride Scalar operations Uniform inputs, uniform outputs Uniform branches c if(c) { } else { } c x=t[i]; i x Uniform conditions Vector load/store (vs. gather/scatter) Affine aligned addresses Memory t[8] t[15] 33

34 From SPMD to SIMD Forward dataflow analysis Statically propagate tags in dataflow graph, C(v), U, A(s), K Propagate value v of constants, stride s of affine vectors, and alignment SPMD mov i0 tid phi i1 φ(i0,i2) load t X[i1] mul t a t store X[i1] t add i2 i1+tcnt branch i2<n? loop A(1) A(1) A(1) φ(a(1), ) K U[A(1)] K U K U[A(1)] K A(1) A(1)+C(tcnt) A(1)<C(n)? 34

35 From SPMD to SIMD Forward dataflow analysis Statically propagate tags in dataflow graph, C(v), U, A(s), K Propagate value v of constants, stride s of affine vectors, and alignment SPMD mov i0 tid phi i1 φ(i0,i2) load t X[i1] mul t a t store X[i1] t add i2 i1+tcnt branch i2<n? loop SIMD A(1) A(1) mov i0 0 A(1) φ(a(1),a(1)) K U[A(1)] K U K U[A(1)] K A(1) A(1)+C(tcnt) A(1)<C(n)? phi i1 φ(i0,i2) vload t X[i1] vmul t a t vstore X[i1] t add i2 i1+tcnt branch i2<n? loop 35

36 Results: instructions, registers Benchmarks: CUDA SDK, SGEMM, Rodinia, Parboil Static operands Unknown Inputs identified Affine Uniform Inputs total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Split register allocation Vector Registers 0% Scalar 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 36

37 Results: memory, control Other Loads identified Unit-strided Uniform Loads total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Divergent Branches identified Uniform Branches total 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% SIMD: identify at compile-time situations that SIMT detects at runtime Uniform branches Uniform, unit-strided loads & stores Scalar instructions, registers 37

38 Static vs. Dynamic data deduplication Static deduplication Dynamic deduplication Allows simpler hardware Preserves binary compatibility Governs register allocation Captures dynamic and instruction scheduling behavior Enables constant propagation Unaffected by (future) software complexity Applies to memory (call stack ) 38

39 Outline Step-by-step introduction to GPU architecture Software model Exploiting explicit parallelism Exploiting implicit regularity Exploiting data regularity Limitations of current GPUs Dynamic data deduplication Static data deduplication Ongoing work Conclusion 39

40 My register file is full of holes! Static partitioning of RF between warps is space-inefficient warp 1 warp 2 Memory phase Compute phase warp 0 Registers alive, generic alive, affine or uniform dead Time We need more dynamic, flexible register allocation AMD: more flexible, but static Need to take into account affine registers Solution: play Tetris? Or use caches 40

41 Ongoing work: affine caches As level-1 cache Extension to the RF: 1 cache line = 1 spilled vector register Space-efficient storage of call stacks Lookup Tags for generic Store data Load data Tags for affine 1 generic vector or 16 affine vectors Unified array Fill WB Inspired from zero-content augmented caches [DPS09] 41

42 Affine cache as L0: RF replacement Affine ALU Instruction Decode Instruction Fetch Arch reg IDs Affine Array Tags Broadcast Instructions ALU/FPU ALU/FPU ALU/FPU ALU/FPU L0 Array L0 Array L0 Array L0 Array µarch reg IDs Replacement policy SIMT execution: only 1 tag lookup / operand Same translation for all lanes Affine ALU handles most control flow and addresses Vector ALUs/FPUs do the heavy lifting Coordinate warp scheduling and cache replacement? 42

43 Bottom line Introduced regularity: instruction, control, memory, data Current GPGPU applications exhibit significant data regularity Both static and dynamic techniques can exploit it Enables power savings Less data storage and movement 43

44 Future challenges SIMT bridges the gap between superscalar and SIMD Smooth, dynamic tradeoff between regularity and efficiency Future microarchitectural improvements? Efficiency (flops/w) SIMT SIMD Dynamic deduplication, Decoupled scalar-simt, Affine caches? Superscalar Dynamic warp formation [Fung07], Dynamic warp subdivision [Meng10], NIMT [Glew09] Transaction processing Computer graphics Regularity of application Dense linear algebra 44

45 References [Glew09] Andy Glew. Coherent vector lane threading. Berkeley ParLab Seminar, [CDT09] Sylvain Collange, David Defour, Arnaud Tisserand. Power consumption of GPUs from a software perspective. ICCS [CDDP10] Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: a parallel functional simulator for GPGPU. MASCOTS [CDZ09] Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009 [FSYA07] Wilson W. L. Fung, Ivan Sham, George Yuan, Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. MICRO [MTS10] Jiayuan Meng, David Tarjan, Kevin Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. ISCA [DPS09] Julien Dusser, Thomas Piquet, André Seznec. Zero-content Augmented Caches. ICS

Around GPGPU: architecture, programming, and arithmetic

Around GPGPU: architecture, programming, and arithmetic Sylvain Collange, Arénaire, LIP, ENS Lyon sylvain.collange@ens-lyon.fr RAIM'11, Perpignan February 9, 2011 Key challenges for parallel architectures