Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate) idle cycles à improve utilization (2) 1

References V. Narasiman, et. al., Improving GPU Performance via Large Warps and Two-Level Scheduling, MICRO 2011 A. S. Vaidya, et.al., SIMD Divergence Optimization Through Intra-Warp Compaction, ISCA 2013 (3) Improving GPU Performance via Large Warps and Two-Level Scheduling V. Narsiman et. al MICRO 2011 Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) 2

Goals Improve performance of divergent code via compaction of threads within a warp Integrate warp scheduling optimization with intra-warp compaction (5) Resource Underutilization 32 warps, 32 threads per warp, single SM Due to control divergence Due to memory divergence (6) 3

Time Time Warp0 Load Warp0 Warp1 Load Warp Scheduling and Locality WarpN Load Interconnection Bus HW Work Queues Pending Kernels Opportunities for memory coalescing Potential for exposing memory stalls Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Core Control Registers SMX SMX SMX SMX Warp Schedulers Warp Context Registers L1 Cache / Shard Memory GPU Load Warp1 Host CPU L2 Cache Memory Controller DRAM Load WarpN Load Degrades memory reference locality and row buffer locality Overlaps memory accesses (7) Key Ideas Conventional GPU RR fetch policy Conventional design today à warp size = #SIMD lanes Use large warps à multicycle issue of sub-warps v Compact threads in a warp to form fully utilized subwarps 2-level scheduler to spread memory accesses in time v Reduce memory related stall cycles (8) 4

Large Warps Typical Operation Warp size = 4 Proposed Approach Warp size = 16 Warp size = SIMD width Large Warp Multi-cycle Issue SIMD Width = 4 RF sub-warp SIMD Width = 4 RF pipeline pipeline pipeline Pipeline pipeline pipeline pipeline Pipeline Large warps converted to a sequence of sub-warps (9) Sub-Warp Compaction Iteratively select one thread per column to create a packed sub-warp Dynamic generation of sub-warps (10) 5

Impact on the Register File Baseline Register File Large Warp Active Mask Organization Large Warp Register File Need separate decoders per bank (11) Scheduling Constraints Next large warp cannot be scheduled until first subwarp completes execution Scoreboard checks for issue dependencies v not available for packing into a sub-warp unless previous issue (sub-warp) has completed à single bit status v Simple check v However on a branch, all sub-warps must complete before it is eligible for instruction fetch scheduling Re-fetch policy for conditional branches v Must wait till last sub-warp finishes Optimization for unconditional branch instructions v Don t create multiple sub-warps v Sub-warping always completes in a single cycle (12) 6

Effect of Control Divergence Note that divergence is unknown until all sub-warps execute v Divergence management just happens on large warp boundaries v Need to buffer sub-warp state, e.g., active masks The last warp effect v Cannot fetch the next instruction in a warp until all sub-warps issue v Trailing warp (warp divergence effect) can lead to many idle cycles Effect of the last thread v E.g., in data dependent loop iteration count across threads v Last thread can hold up reconvergence (13) Time Warp0 Load A Round Robin Warp Scheduler Warp1 Load WarpN Load Exploit inter-warp reference locality in the cache Interconnection Bus HW Work Queues Pending Kernels Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Core Control Registers SMX SMX SMX SMX Warp Schedulers Warp Context Registers L1 Cache / Shard Memory GPU Exploit inter-warp reference locality in the DRAM row buffers However, need to maintain latency hiding Host CPU L2 Cache Memory Controller DRAM (14) 7

Two Level Round Robin Scheduler Fetch Group 0 Fetch Group 1 LW0 LW1 LW4 LW5 LW2 RR LW3 LW6 LW7 Fetch Group 2 RR Fetch Group 3 LW8 LW9 LW12 LW13 LW10 LW11 LW14 LW15 Fetch Group Size: Enough to keep the pipeline busy (15) Scheduler Behavior Need to set fetch group size carefully à tune to fill the pipeline Timeout on switching fetch groups to mitigate the last warp effect (16) 8

Summary Intra-warp compaction made feasible due to multicycle warp execution v Mismatch between warp size and SIMD width enables flexible intra-warp compaction Do not make warps too big à last thread effect begins to dominate (17) SIMD Divergence Optimization Through Intra-warp Compaction A. S. Vaidya, et. al ISCA 2013 Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) 9

Improve utilization in divergent code via intra-warp compaction Goals Become familiar with the architecture of Intel s Gen integrated general purpose GPU architecture (19) Integrated GPUs: Intel HD Graphics Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (20) 10

Integrated GPUs: Intel HD Graphics Shared physical memory Gen graphics processor 32-byte bidirectional ring Dedicated coherence signals Figure from The Computer Architecture of the Intel Processor Graphics Gen9, Coherent distributed cache Shared with GPU Operates as a memory side cache (21) Inside the Gen9 EU Architecture Register File (ARF) 0 Per thread register state Up to 7 threads 128, 256-bit registers/thread (8-way SIMD) Each thread executes a kernel v s may execute different kernels v Multi-instruction dispatch Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (22) 11

Operation (2) SIMD-16, SIMD-8, SIMD-32 instructions transform SIMD-4 instructions Divergence/reconvergence management Support both FP and Integer operations 4, 32-bit FP operations 8, 16-bit integer operations 8, 16-bit FP operations MAD operations each cycle 96 bytes/cycle read BW 32 bytes/cycle write BW Dispatch 4 instructions from 4 threads Constraints between issue slots Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (23) Operation SIMD-16, SIMD-8, SIMD-32 instructions transform SIMD-4 instructions Intermix SIMD instructions of various lengths SIMD 8 SIMD 16 SIMD 4 Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (24) 12

Mapping the BSP Model SIMD-16, SIMD-8, SIMD-32 instructions Grid 1 Block (0, 0) Block (0, 1) transform Block (1, 0) Block (1, 1) SIMD-4 instructions Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) (0,0,0) (0,1,0) (0,0,1) (0,1,1) (0,0,2) (0,1,2) (0,0,3) (0,0,0) (0,1,3) Map multiple threads to SIMD instance executed by a EU All threads in a TB or workgroup mapped to same thread (shared memory access) Figure from The Computer Architecture of the Intel Processor Graphics Gen9, (25) Subslice Organization #Eus * #threads/eu determines width of the slice Flexible data interface v Scatter/gather support v Memory request coalescing across 64-byte cache lines v Shared memory access From The Computer Architecture of the Intel Processor Graphics Gen9, (26) 13

Slice Organization Shared memory 64Kbyte/slice Not coherent with other structures Flexible partitioning SM Data cache Buffers for accelerators From The Computer Architecture of the Intel Processor Graphics Gen9, (27) Product Organization Load balancing Honor barrier and shared memory constraints Shared virtual memory Share pointer rich data structures between CPU and GPU Coherent shared memory between CPU and GPU Implemented shared atomics (with CPU) From The Computer Architecture of the Intel Processor Graphics Gen9, (28) 14

Coherent Memory Hierarchy Not coherent From The Computer Architecture of the Intel Processor Graphics Gen9, (29) Microarchitecture Operation I-Fetch Decode Per-thread operation I-Buffer Issue PRF RF Per-thread scoreboard check arbitration and dual issue/2-cycles Operand fetch/swizzle Encode swizzle in RF access Pipeline pipeline pipeline Instruction execution happens in waves of 4-wide operations Note: variable width SIMD instructions D-Cache All Hit? Data pending Writeback (30) 15

Divergence Assessment SIMD Efficiency Coherent applications Divergent Applications (31) Basic Cycle Compression RF for a Single Operand Example: Actual operation depends on data types, execution cycles/op Note power/energy savings (32) 16

Swizzle Cycle Compression (33) SCC Operation RF for a Single Operand Can compact across Quads Swizzle settings overlapped with RF access 128b Pack lanes into a quad Note increased area, power/energy cycle i cycle i+1 cycle i+2 cycle i+3 4 lanes (34) 17

Compaction Opportunities SIMD 8 Idle lanes For K active threads what is the maximum cycle savings for SIMD N instructions? SIMD 16 No further compaction possible Idle lanes No further compaction possible (35) Performance Savings Difference between saving cycles and saving time v When is #cycles time? (36) 18

Multi-cycle warp/simd/work_group execution Summary Optimize #cycles/warp by compressing idle cycles v Rearrange idle cycles via swizzling to create opportunity Sensitivities to the memory interface speeds v Memory bound applications may experience limited benefit (37) Intra-Warp Compaction A Block Block B C D F E G Scope limited to within a warp Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp (38) 19