Low-Complexity Reorder Buffer Architecture*

Size: px

Start display at page:

Download "Low-Complexity Reorder Buffer Architecture*"

Horatio Carson
6 years ago
Views:

1 Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS 02), June 24 th 2002 *supported in part by DARPA through the PAC-C program and NSF ICS 02 1

2 Outline ROB complexities Motivation for the low-complexity ROB Low-complexity ROB design Results Concluding remarks ICS 02 2

3 Pentium III-like Superscalar Datapath Instruction Issue Function Units Architectural Register File IQ FU1 F1 Fetch F2 D1 D2 Decode/Dispatch LSQ FU2 FUm EX ROB ARF Instruction dispatch D-cache Result/status forwarding buses ICS 02 3

4 ROB Port Requirements for a W-way CPU Decode/Dispatch W write ports to setup entries Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands ROB Commit W read ports for instruction commitment ICS 02 4

5 Where are the Source Values Coming From? Instruction Issue Function Units Architectural Register File F1 F2 D1 1 D2 2 IQ FU1 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 5

6 Where are the Source Values Coming From? 100% Forwarding ARF ROB 62% 32% 6% 80% 60% 40% 20% 0% bzip2 gcc gap gcc mcf parse r perlbmk twolf vortex vpr applu apsi art equake mesa mgrid swim wupwise Avg. Int. Avg. fp. Ave rage 96-entry ROB, 4-way processor SPEC2K Benchmarks ICS 02 6

7 How Efficiently are the Ports Used? Decode/Dispatch W write ports to setup entries Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands 6% ROB Commit W read ports for instruction commitment ICS 02 7

8 Approaches to Reducing ROB Complexity Reduce the number of read ports for reading out the source operand values More radical (and better): Completely eliminate the read ports for reading source operand values! ICS 02 8

9 Reducing the Number of Read Ports Average IPC Drop: 1 read port 2 read ports 3.5% 1.0% Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 9

10 Problems with Retaining Fewer Source Read Ports on the ROB Need arbitration for the small number of ports Additional logic needed to block the instructions which could not get the port. Need a switching network to route the operands to correct destinations Multi-cycle access still remains in the critical path of Dispatch/Issue logic ICS 02 10

11 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File F1 F2 D1 1 D2 2 IQ FU1 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 11

12 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File F1 F2 D1 1 D2 2 IQ FU1 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 12

13 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File 1 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 13

14 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction 71% Shorter bit and wordlines ICS 02 14

15 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch D-cache Result/status forwarding buses Area Reduction 45% ICS 02 15

16 Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation Power is reduced because: shorter bitlines and wordlines lower capacitive loading fewer decoders fewer drivers and sense amps ICS 02 16

17 Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING ICS 02 17

18 Late Forwarding: Use the Normal Forwarding Buses! Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch Result/status forwarding buses: D-cache ICS 02 18

19 Late Forwarding: Use the Normal Forwarding Buses! Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch Result/status forwarding buses: D-cache ICS 02 19

20 Optimizing Late Forwarding PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance SOLUTION: Selective Late Forwarding (SLF) SLF requires additional bit in the ROB That bit is set by the dispatched instructions that require Late Forwarding No additional forwarding buses are needed, since SLF traffic is very small ICS 02 20

21 Late Forwarding: Use the Normal Forwarding Buses! Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch Result/status forwarding buses: D-cache Only 3.5% of the traffic is from SELECTIVE LATE FORWARDING ICS 02 21

22 Performance Drop of Simplified ROB Performance Drop % No ROB read ports with SLF 1 read port 2 read ports Average IPC Drop: 9.6% 3.5% 1.0% 17% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 37% applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 22

23 IPC Penalty: Source Value Not Accessible within the ROB Result Generation Forwarding Lifetime of a Result Value Late Forwarding/ Commitment Value within ROB Value within ARF time ICS 02 23

24 Improving IPC with No Read Ports Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports ICS 02 24

25 Datapath with the Retention Latches Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch D-cache Result/status forwarding buses ICS 02 25

26 Datapath with the Retention Latches Instruction Issue Function Units RETENTION LATCHES Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch D-cache Result/status forwarding buses ICS 02 26

27 The Structure of the Retention Latch Set 8 or 16 latches L recently-written results (L=1 or 2 works great) Status Result Values L-ported CAM field (key = ROB_slot_id) W write ports for writing up to W results in parallel L ROB slot addresses (L=1 or 2) ICS 02 27

28 Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO ICS 02 28

29 Hit Ratios to Retention Latches 100 Average Hit Ratio: FIFO 8 2 FIFO 16 2 LRU 8 2 LRU % 55% 56% 62% Hit Ratios bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 29

30 Accessing Retention Latch Entries ROB index is used as a unique key in the Retention Latches to search the result values Need to maintain unique keys even when we have: Reuse of a ROB slot: Not a problem for FIFO simply flush a RL entry at commit time for LRU Branch mispredictions ICS 02 30

31 Handling Branch Mispredictions Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed Uses branch tags Complicated implementation Complete RL Flushing: All retention latch entries are flushed Very simple implementation Performance drop is only 1.5% compared to selective flushing ICS 02 31

32 Misprediction Handling: Performance Selective flushing Complete flushing IPC Average IPC Drop: 1.5% bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg. ICS 02 32

33 Experimental Setup: the AccuPower (DATE 02) Compiled SPEC benchmarks Datapath specs Microarchitectural Simulator (Rooted in SimpleScalar) Performance stats Transition counts, Context information VLSI layout data SPICE deck SPICE Energy/Power Estimator Power/energy stats SPICE measures of energy per transition ICS 02 33

34 Configuration of the Simulated System Machine width Issue Queue 4-way 32 entries Reorder Buffer Load/Store Queue 96 entries 32 entries Simulated the execution of SPEC2000 benchmarks ICS 02 34

35 Assumed Timings Smaller delay: few latches Rename Table lookup for ROB index Source operand read from the ROB Source operand read from the ROB Rename Table Lookup for ROB index Associative lookup of operand from retention latches using ROB index as a key D1 D2 D3 D1 D2 Timing of the baseline model Timing of the simplified ROB ICS 02 35

36 Experimental Results: Effect on Performance Avg. IPC Drop: ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU 0.1% -1.6% -1.0% -2.3% Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 36

37 Experimental Results: Effect on Performance Avg. IPC Drop: ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU 3.3% 1.7% 2.3% 1.0% Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 37

38 Experimental Results: Effect on Power Avg. Savings: No ROB ports 8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU 30% 23.4% 22.2% 21% 20.2% Power Savings % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 38

39 Summary of Results Significantly reduced ROB complexity and power dissipation 45% area reduction 20% to 30% power reduction across SPEC 2000 benchmarks Actual IPC improvements: 1.6% to 2.3% gain across SPEC benchmarks IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access) ICS 02 39

40 Related Work Value-Aging Buffer (Hu & Martonosi, PACS 2000) Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA 02) Multiple Register Banks (Cruz et.al., ISCA 00 & Balasubramonian et.al., MICRO 01) See paper for discussions ICS 02 40

41 Conclusions Typical source operand location statistics can be successfully exploited to reduce ROB complexity Significant reduction in ROB area and power no ROB ports needed for reading source operands IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle ICS 02 41

42 Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS 02), June 24 th 2002 *supported in part by DARPA through the PAC-C program and NSF ICS 02 42

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev