Register File Organization

Size: px

Start display at page:

Download "Register File Organization"

Tabitha Edwards
6 years ago
Views:

1 Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities for optimization in accessing the register file (2) 1

2 Reading S. Liu et.al, Operand Collector Architecture, US Patent 7,834,881 v Perspective of a lane J. H. Choquette, et. Al., Methods and Apparatus for Source Operand Caching, US Patent 8,639,882 v Perspective of instruction scheduling GPGPUSim, GPGPU-Sim_3.x_Manual#Introduction (3) Register File Access: Recap I-Fetch Banks 0-15 Decode Arbiter I-Buffer Issue 1024 bit Single ported Register File Banks PRF RF Xbar scalar Pipeline scalar pipeline D-Cache All Hit? Data Writeback scalar pipeline pending warps OC OC OC OC Operand Collectors (OC) DU DU DU DU Dispatch Units (DU) ALUs L/S SFU (4) 2

3 The SM Register File Main Register File (MRF) warp Xbar LS LS SF LS SF LS LS SF NVIDIA Fermi: 128KB/SM, 2MB per device Throughput-optimized design v 32 instructions/warp, up to 96 operands/warp/cycle Organization? LS LS SF (5) Multi-ported Register Files Read register number 1 Read register number 2 Register 0 Register 1 Register n 1 Register n M u x Read data 1 Write Register number 0 1 n-to-1 decoder n 1 n C Register 0 D C Register 1 D M u x Read data 2 Register data C Register n 1 D C Register n D Multi-ported register file organization Area and delay grows with #ports Use multiple banks with single read and write ports to emulate multiple ports (6) 3

4 Multiple Banks Bank Organization R0 R1 R0 R1 R0 R1 R0 R1 R63 R63 R63 R bits 32 bits 1R/1W single read port and single write port per bank Each access to a register bank produces the same named register per lane Concurrently access multiple banks (7) Thread Register Allocation Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3 T2 T2 T2 T2 T1 T2 T3 T4 T1 T1 T1 T1 T1 T2 T3 T4 Fat Allocation Thin Allocation Operands (registers) of a thread can be mapped across register banks in different ways Thin, fat, and mixed Skewed allocation Goal is maximum bandwidth (8) 4

5 Register File Organization 128Kbyte register file per SM 48 warps per SM, 21 registers/thread 1536 threads/sm Why and when do bank conflicts occur? Arbiter Operand access for a warp may take several cycles! need a way to collect operands! 1024 bit Example: (9) Collecting Operands Operand Set 3 Operand Set 2 Operand Set 1 Operand Set 0 Op3 Op3 Op3 Op3 Op2 Op2 Op2 Op2 Op1 Op1 Op1 Op1 Op0 Op0 Op0 Op0 Effectively operates as a cache Re-use determined by interconnect Operand RE-use across instructions with independent mux control Result FIFO No operand RE-use across inputs Need more wiring Use an Xbar for most flexible re-use Example: From the perspective of a single lane (10) 5

6 Operand Caching Operand Set 3 Op3 Op3 Op3 Op3 Cache Table Operand Set 2 Op2 Op2 Op2 Op2 Operand Set 1 Op1 Op1 Op1 Op1 Operand Set 0 Op0 Op0 Op0 Op0 Register# Set by the dispatch unit Individual vs. common settings Queried by the dispatch unit Register writes invalidate entries in the cache table Result FIFO (11) Instruction Dispatch I-Fetch Decode Decode to get register IDs I-Buffer Issue PRF RF Check Operand Cache (cache table) Set Crossbar and MRF bank indices Op3 Op3 Op3 Op3 Op2 Op2 Op2 Op2 Op1 Op1 Op1 Op1 Op0 Op0 Op0 Op0 scalar Pipeline scalar pipeline scalar pipeline D-Cache All Hit? Data pending warps Update Cache Table Set ALU and OC Interconnect Writeback (12) 6

7 Instruction Perspective Banks 0-15 Arbiter 1024 bit Xbar OC OC OC OC OC request read operands Prioritize writes over reads Schedule read requests for maximum BW DU DU DU DU ALUs L/S SFU (13) The Operand Collector Banks bit Arbiter Buffer (warp size) operands at the collector Xbar OC OC OC OC DU DU DU DU ALUs L/S SFU Sharing of operands across instructions Operates as an operand cache for collecting operands across bank conflicts Simplifies scheduling accesses to the MRF (14) 7

8 Register File Access: Coherency Banks 0-15 Arbiter 1024 bit What happens when a new value is written back the Main Register File (MRF)? v OC values must be invalidated Xbar OC OC OC OC DU DU DU DU Instruction V Reg RDY WID Operand (128 bytes) V Reg RDY WID Operand (128 bytes) V Reg RDY WID Operand (128 bytes) ALUs L/S SFU An example OC Sim_3.x_Manual#Register_Access_and_the_Operand_Collector (15) Pipeline Stage Banks 0-15 Arbiter OC allocated and initialized after decode 1024 bit Source operand requests are queued at arbiter Xbar Instruction V Reg RDY WID Operand (128 bytes) OC OC OC OC DU DU DU DU V Reg RDY WID Operand (128 bytes) V Reg RDY WID Operand (128 bytes) ALUs L/S SFU Operands/cycle à OC limited by interconnect (16) 8

9 Functional Unit Level Collection Banks 0-15 Arbiter 1024 bit Xbar OC OC OC OC DU DU DU DU ALUs L/S SFU Operand collectors are associated with different functional unit types Can naturally support heterogeneity Dedicated vs. shared OC units v Connectivity consequences Other sources of operands v Constant cache, read-only cache (17) Summary Register file management and operand dispatch has multiple interacting components Performance complexity tradeoff v Concurrency increase requires increasing interconnect complexity v Stalls/conflicts require buffering and bypass to increase utilization of the execution units Good register file allocation in critical (18) 9

10 P. Xiang, Y. Yang, and H. Zhou, Warp Level Divergence: Characterization, Impact, and Mitigation HPCA 2014 Sudhakar Yalamanchili unless otherwise noted (19) Understand resource fragmentation in stream multiprocessors Objectives Understand challenges and potential solutions of mitigation techniques (20) 10

11 Keeping Track of Resources Interconnection Bus HW Work Queues Pending Kernels Host CPU Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Core Control Registers SMX SMX SMX SMX L2 Cache Memory Controller Warp Schedulers Warp Context Registers L1 Cache / Shard Memory DRAM GPU Thread Block Control Registers (TBCR) per SMX KDEI BLKID KDE Index Scheduled TB ID (in execution) SMX Scheduler Control Registers (SSCR) KDEI KDE Index NextBL Next TB to be scheduled What resources do we need to launch a TB? When can we launch a TB? (21) The Fragmentation Problem The last warp Completed (idle) warp contexts Interconnection Bus HW Work Queues Pending Kernels Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Warp Core Control Registers SMX SMX SMX SMX TB Warp Warp Schedulers Warp Context Registers L1 Cache / Shard Memory GPU Registers allocated to completed warps in the TB: temporal underutilization Available registers: spatial underutilization Temporal & spatial underutilization Host CPU L2 Cache Memory Controller DRAM Goal: How can we improve utilization? (22) 11

Key Issues TB resources are not released until the last warp has completed execution v Idle registers v Idle shared memory segments v Idle warp contexts What are the limiting factors?

12 Key Issues TB resources are not released until the last warp has completed execution v Idle registers v Idle shared memory segments v Idle warp contexts What are the limiting factors? v Shared memory size v # registers/thread v # threads/sm (equivalent to # warps/sm) What can be done to increase utilization? (23) Register Utilization Limited by number of TBs not registers Idle register usage caused by saturation on other resources Large percentage of idle registers can be exploited by v Prefetchers? v Power-gating à reduction in static power From P. Xiang, Et. Al, Warp Level Divergence: Characterization, Impact, and Mitigation (24) 12

13 Execution Imbalance Completed warps Causes v Input-dependent workload imbalance v Program dependent workload imbalance v Memory divergence v Warp scheduling policy From P. Xiang, Et. Al, Warp Level Divergence: Characterization, Impact, and Mitigation (25) Partial TB Dispatch Interconnection Bus HW Work Queues Pending Kernels Host CPU KDE Index Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Core Control Registers SMX SMX SMX SMX L2 Cache Thread Block Control Registers (TBCR) per SMX: need an entry KDEI BLKID Memory Controller Warp Schedulers Warp Context Registers L1 Cache / Shard Memory DRAM Tracking of dispatched warps for a partial TB Warp information Scheduled TB ID (in execution) GPU Need an entries Need sufficient registers Need TB-level storage Dispatch some warps from the next TB Check TB SM resources Other checks are for warp level resources Need support for partial dispatch Tracking issued vs. non-issued warps (26) 13

14 Partial TB Dispatch (2) Kernel Distributor Entry PC Dim Param ExeBL Interconnection Bus HW Work Queues Pending Kernels Kernel Management Unit Kernel Distributor SMX Scheduler Core Core Core Core Control Registers SMX SMX SMX SMX Warp Schedulers Warp Context Registers L1 Cache / Shard Memory Always dispatch warps from the partial warp first Only 1 partial TB at a time GPU Host CPU L2 Cache Memory Controller DRAM Thread Block Control Registers (TBCR) per SMX: need an entry Workload Buffer KDEI BLKID TBID Start_warpID End_warpID Valid KDE Index (27) Summary Software tuning of TB is possible v Adversely affects and intra-tb sharing Power savings advantages v Static power due to reduced execution time v Workload remains roughly constant à possible increase in contention at shared resources Hardware support required relatively modest (28) 14

15 On Demand Register Allocation and De- Allocation for a Multithreaded Processor D. Tarjan and K. Skadron Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Register File Fragmentation The last warp Interconnection Bus HW Work Queues Pending Kernels Kernel Management Unit Kernel Distributor Entry PC Dim Param ExeBL Kernel Distributor SMX Scheduler Core Core Core Warp Core Control Registers SMX SMX SMX SMX TB Warp Warp Schedulers Warp Context Registers L1 Cache / Shard Memory GPU Registers allocated to completed warps in the TB: temporal underutilization Host CPU L2 Cache Memory Controller DRAM Goal: How can we improve utilization? (30) 15

16 Goal Register allocation de-allocation to increase utilization and/or lower power Dynamic, cycle level allocation and de-allocation via register remapping (renaming) v Increase performance for fixed amount of register storage v Decrease register storage for fixed performance Arbiter 1024 bit Xbar ALUs L/S SFU (31) For each thread v Allocate on write rather than on thread creation v Release ASAP v à just-in-time (JIT) hardware register allocator for multithreaded processors Need a spilling mechanism to avoid deadlock v Remember this is JIT allocator! Basic idea à register renaming Approach (32) 16

17 Overview Allocation Check Rename Map Virtual Register ID True/False Virtual Register ID Free List Physical Register ID Decode Rename Issue Cycle-level dynamic allocation/de-allocation Recycling register IDs v Maintaining performance and correctness (33) Basic Steps Allocation De-allocation v In-order issue v Out of order issue Register footprints and MRF size v Size for maximum size o Drowsy and power gated register cell modes v Dynamic spilling (34) 17

18 Dynamic Spilling Spill to Memory To Register spill area Spill to Local Storage MRF Base Address Register or Cache? MRF spill L1 D-Cache VID Memory address spill Secondary Register File Strategies for spilling Treat as offset into spill area (35) An Experiment Create TBs of 256 threads and 32 registers/thread v Each TB requires 8K registers GTX480 32K registers Occupancy for each SM is 4 TBs but is actually greater! Why? From P. Xiang, Et. Al, Warp Level Divergence: Characterization, Impact, and Mitigation (36) 18

19 Net effect is to improve register file utilization Summary Note that registers are recycled at a finer granularity than TB boundaries You can run experiments on NVIDIA parts to see these effects (speculatively assume something like this is happening). (37) 19

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)