Understanding GPGPU Vector Register File Usage

Size: px

Start display at page:

Download "Understanding GPGPU Vector Register File Usage"

Jodie Bishop
5 years ago
Views:

1 Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington

2 AGENDA GPU Architecture gem5 Model and Simulation Vector General Purpose Register Usage Operand Buffering and Register File Caching Dispatch Limits and Wave Level Parallelism Conclusion 2 PUBLIC

3 GPU Architecture 3 PUBLIC

4 GPU ARCHITECTURE HIGH LEVEL OVERVIEW Massively multi-threaded compute engines Use parallelism to hide memory latency Programmed with Single Instruction, Multiple Thread (SIMT) model Kernel Threads Workgroup Wavefront Hardware executes instructions using Single Instruction, Multiple Data (SIMD) model Lockstep execution of threads in a Wavefront Huge amount of on-chip context to enable single cycle thread context switching Register files are larger than the data caches 4 PUBLIC

5 EXAMPLE COMPUTE UNIT ARCHITECTURE Compute Unit (CU) responsible for executing workgroups Each CU contains: Wavefront (WF) contexts SIMD Vector ALUs Scalar ALUs Register Files Memory Pipelines L1 Vector Data Cache Local Data Share (LDS) Each CU connects to: Scalar Cache Instruction Cache L2 Data Cache / DRAM WF Context 1 SIMD VALU Vector RF Scalar Cache Compute Unit (CU) Instruction Fetch Dependency Logic Instruction Arbitration & Scheduler Scalar Memory Pipeline WF Context 2 SALU Scalar RF LDS Execution Units Local Memory Pipeline SIMD VALU Vector RF Data Cache Global Memory Pipeline WF Context N SALU Scalar RF I-Cache Figure 1. Sample CU Architecture 5 PUBLIC

6 Operand Buffer Register File Cache VECTOR REGISTER FILE SUBSYSTEM Each VALU has a private Vector Register File (VRF) 512 Vector General Purpose Registers (VGPRs) per VRF 128 KB VGPR storage per VRF/SIMD VALU VGPRs distributed across 4 SRAM banks 1 read, 1 write per cycle per bank Operand Buffer (OB) Instruction FIFO Reads vector operands for VALU instructions Register File Cache (RFC) VGPR cache, strict LRU replacement Stores VALU results Forwarding paths to OB, VALU, Vector Memory units Vector RF Bank 0 Vector RF Bank 1 Vector RF Bank 2 Vector RF Bank 3 Lane 0 Lane 1 Lane 2 SIMD VALU Lane 61 Lane 62 Lane 63 Figure 2. Vector Register File Subsystem Architecture 6 PUBLIC

7 gem5 Model and Simulation 7 PUBLIC

8 GEM5 MODEL AND SIMULATION Execution driven, cycle-level microarchitecture simulator System Call Emulation (SE) mode Faithfully implement CU as described in previous slides, capable of executing AMD GCN3 ISA Unmodified, publically available ROCm-1.1 software stack Emulated kernel driver functionality Wavefront and Instruction Readiness Check every wavefront each cycle to determine if it has a ready instruction Instruction Dispatch and Execution Select one wave per execution resource per cycle from list of ready waves Initiate register reads VALU operations sent to Operand Buffer Single CPU, single CU system 8 PUBLIC

9 BENCHMARKS Benchmark Array-BW Bitonic Sort CoMD FFT HPGMG MD SNAP SpMV XSBench Description Memory streaming Parallel Merge Sort DOE Molecular-dynamics algorithms Digital signal processing Ranks HPC systems Generic Molecular-dynamics algorithms Discrete ordinates neutral particle transport application Sparse matrix-vector multiplication Monte Carlo particle transport simulation Table 1. Benchmarks 9 PUBLIC

10 VGPR Usage 10 PUBLIC

11 VGPR USAGE Jeon et al. claim many registers contain no live value for significant portion of execution Gebhart et al. claim most register values produced are read only once, and the read occurs within a small distance of the value being produced Question: Do these observations hold for our simulated architecture and GCN3 code? Yes! Analysis: Dynamic instrumentation of vector register usage in gem5 simulation to track read after write distance and number of reads per write per value produced 11 PUBLIC

12 Vector Registers w/ Live Values VGPR USAGE Live Vector Register Values Over Time - CoMD Vector Registers w/ Live Values Vector Registers Allocated Dynamic Instructions Executed 12 PUBLIC

13 Vector Registers w/ Live Values VGPR USAGE Live Vector Register Values Over Time - HPGMG Vector Registers w/ Live Values Vector Registers Allocated Dynamic Instructions Executed 13 PUBLIC

14 Percent of All Vector Values Produced VGPR USAGE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Series1 Series2 Series3 Series Figure 3. Number of reads per vector register value 14 PUBLIC

15 Percent of All Vector Values Consumed VGPR USAGE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Series1 Series2 Series3 Series Figure 4. Producer to consumer distance 15 PUBLIC

16 Operand Buffering and Register File Caching 16 PUBLIC

17 OB AND RFC EFFECTIVENESS VALU VGPR READS SAVED OVERVIEW Goal: evaluate effectiveness of the Register File Cache at reducing the number of register file reads performed by the Operand Buffer Experiment: sweep RFC size from 2 to 512 entries per SIMD/VRF and examine number of VALU VGPR reads saved and performance impact Conclusion: Register File Cache can reduce between 19% and 41% of VALU VGPR reads at size 8 with stable performance, and greater fraction of reads at larger sizes 17 PUBLIC

18 OB AND RFC EFFECTIVENESS VALU VGPR READS SAVED NUMBER OF VALU VGPR READS SAVED BY RFC FORWARDING 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% VALU VGPR VRF Reads Saved100% RFC-2 RFC-4 RFC-8 RFC-16 RFC-32 RFC-64 RFC-128 RFC-256 RFC PUBLIC Figure 5. Number of VALU VGPR reads saved by RFC forwarding paths

19 OB AND RFC EFFECTIVENESS OVERVIEW Goal: evaluate effectiveness of the Operand Buffer and Register File Cache structures in hiding register file bank conflicts Conclusions: Operand Buffer and Register File Cache are highly effective at hiding bank conflict stalls and latency penalties 19 PUBLIC

20 OB AND RFC EFFECTIVENESS EXPERIMENTAL SETUP Experiment: compare performance of baseline architecture to one that has no register file bank access conflicts Upper Bound Study Conflict Free configuration places each VGPR into its own VRF bank 20 PUBLIC

21 Relative IPC OB AND RFC EFFECTIVENESS PERFORMANCE ESTIMATION 101% 100% Baseline Conflict Free 99% 98% 97% 96% 95% Figure 6. Relative performance of Conflict Free configuration compared to baseline (Note y-axis begins at 95%) 21 PUBLIC

22 Dispatch Limits and Wave Level Parallelism 22 PUBLIC

23 DISPATCH LIMITS AND WLP OVERVIEW Goal: evaluate impact on WLP and performance of providing twice as many VGPRs per SIMD Conclusions: Providing additional VGPRs allows more workgroups/wavefronts to be dispatched per CU, but may not always provide a performance benefit 23 PUBLIC

24 Relative WLP Relative IPC DISPATCH LIMITS AND WLP RESULTS Baseline 2x VGPR Series1 Series2 350% 140% 300% 250% 200% 150% 120% 100% 80% 100% 60% 50% 0% 40% 20% 0% Figure 9. Relative WLP with 2x VGPR per SIMD Figure 8. Relative performance with 2x VGPR per SIMD 24 PUBLIC

25 CDF of Vector Memory Access Latency DISPATCH LIMITS AND WLP HEAD TAIL LATENCY 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% FFT Baseline MD Baseline SpMV Baseline XSBench Baseline FFT 2x VGPR MD 2x VGPR SpMV 2x VGPR XSBench 2x VGPR Vector Memory Access Latency Figure 10. CDF of Vector Memory Access Latency Selected Benchmarks 25 PUBLIC

26 Conclusion 26 PUBLIC

27 CONCLUSION Replicate studies in prior works on vector register usage. Values have short RAW distance and are read only a handful of times Operand Buffers and Register File Caches can hide penalties from register file bank conflicts Increased wave level parallelism (WLP) does not guarantee better performance 27 PUBLIC

FUTURE RESEARCH DIRECTIONS Non-graphics influenced general purpose compute accelerator architectures Architecture Programming model Smarter strategies for

28 FUTURE RESEARCH DIRECTIONS Non-graphics influenced general purpose compute accelerator architectures Architecture Programming model Smarter strategies for handling memory divergence Memory bandwidth provided is adequate for compute bound applications But, it limits memory divergent and irregular applications 28 PUBLIC

30 DISCLAIMER & ATTRIBUTION Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Attribution 2017 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. 30 PUBLIC

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time