SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

Size: px

Start display at page:

Download "SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13"

Charity Miller
5 years ago
Views:

1 SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

2 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD control flow divergence on if/else condition Common solution: sequentially execute all the control flow paths for all channels Both the if and else portion are executed in turn by all channels, while turning off appropriate channels in each path Recent studies: combine threads from different warps that have the same if/else flow path Problem: increase memory divergence (i.e. the number of distinct memory or cache line requests per SIMD instruction)

3 Observation The number of hardware execution lanes is typically a fraction of the SIMD instruction width 4-wide SIMD ALU in Intel s Ivy Bridge GPU. Wide SIMD instructions typically executes over multiple execution cycles due to narrower hardware width.

4 Goal By exploiting the difference between logical and physical SIMD width of a GPU pipeline, this work addresses the SIMD control divergence problem with intra-warp compaction

5 GPU register file r1 r0 r(n-1) Warp 0 r1 r2 Warp 1 r1 r0 r(n-1) r2 Warp 0 r1 r0 16 lanes

6 Basic Cycle Compression (BCC) Fused multiply-add (FMA) r3 = r0 * r1 + r2 r0 / 1 r1 / 2 r2 / Instead, we want to issue a next warp at cycle 5.

7 Basic Cycle Compression (BCC) If else In this example, the compressed execution time = the execution time without the divergence caused by the if/else" clause

8 Unfruitful Cases for BCC Turned off channels in an instruction are not contiguous or contiguous but not favorably aligned to the hardware SIMD pipeline width

9 Swizzled Cycle Compression (SCC) The positions of disabled and enabled channels are rearranged If else

10 Control Algorithm for Swizzling Method: 1. Detect the optimal number of cycles for execution 2. Balance occupancy across lanes Optimum cycle: 8/4 = 2 Lane Total Lane Total Lane Total Lane Total For 1 st EXE cycle, fill idle lanes (1, 3) from busy lanes For 2 nd EXE cycle, fill idle lanes (1, 3) similarly

11 Simulation Methods Execution-driven simulation In-house cycle-level Intel GPGPU simulator Standalone GPU simulation A module in parallel CPU+GPU simulation Entire GPU performance simulation with entire memory hierarchy 50+ OpenCL benchmark applications evaluated Trace-driven simulation GPU core performance simulation only ~600 OpenCL, OpenGL, multimedia workload traces

12 BFS HtS LavaMD NW Part EV RT-PR-Conf RT-PR-AL RT-PR-BL RT-PR-WM RT-AO-AL RT-AO-BL RT-AO-WM LuxMark-sky LuxMark_sala luxmark_oclcp bulletphysics oclprofv1p0 rightware_mandelbulb tree_search LuxMark_hdr OptSAA sandra_ocl ati-eigenval ati_floydwarshall glbench_egypt glbench_pro FD_IntelFinalists FD_politicians ALU cycles saved Results ALU cycles saved (OpenGL and OpenCL) 50% 40% 30% 20% 10% 0% SCC% BCC%

13 Speedup Results System Performance (OpenCL; RayTracing) Dependent on Data Cluster Bandwidth (L3 cache) 60% 40% 1 L3$ line / cycle 2 L3$ lines / cycle bandwidth On average (across divergent applications), +12% with 1$ line / cycle bandwidth +18% with 2$ line / cycle bandwidth 20% 0% SCC% BCC%

14 Conclusion SIMD control divergence solutions Exploiting the multi-cycle execution feature of GPUs Intra-Warp Compaction Basic cycle compression Swizzled cycle compression

15 Register file organization Baseline: use pairs of registers BCC: fetch only half width registers

16 Operand fetch (16 lanes, 512b) is done in 1-cycle. This operand is held in a 512b latch. Each quad (128b) passes through a four lane swizzler with individual lane enables. Register file organization Overhead

SIMD Divergence Optimization through Intra-Warp Compaction

SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha S. Vaidya Intel Corporation Santa Clara, CA, USA asvaidya@gmail.com Roy Saharoy Intel Corporation Santa Clara, CA, USA roy.saharoy@intel.com