CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3
MOTIVATION Problem: Limited processor resources Goal: More efficient use by aggressive recycling Opportunity: Resources reserved until retirement Solution: Decouple recycling from retirement : ed Early Resource Recycling in Out-of-order Microprocessors Slide 2/41
CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry Add STQ Entry Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Add Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41
PROPOSAL: EARLY RECYCLING Decouple resource recycling from instruction retirement Recycle when no longer needed Targeted resources: Load queue entries Store queue entries Physical registers Potential when targeted resources are unlimited: 1.12 speedup in SPECint2000 1.32 speedup in SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 4/41
EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41
CONSTRAINT: PRECISE STATE Events that require precise state: Branch mispredictions Memory replay traps Exceptions Interrupts Early recycling makes this difficult E.g. recycled registers no longer available E.g. memory may be overwriten prematurely : ed Early Resource Recycling in Out-of-order Microprocessors Slide 6/41
PROPOSAL: POINT OF NO RETURN Classify events according to frequency: Frequent: branch mispredictions, memory replays Infrequent: exceptions Split ROB into two logical sets of instructions Irreversible set Subject to infrequent events only recovery Reversible set: Subject to frequent and infrequent events Conventional ROB recovery mechanisms Boundary: Point of No Return (PNR) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 7/41
CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41
CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41
CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41
OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 9/41
TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
TIMELINE Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
TIMELINE Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41
ENTERING CHERRY MODE Backup architectural registers Can be done in a handful of cycles Allow PNR to race ahead of ROB head Updates to cache hierarchy set Volatile bit Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 11/41
PERIODIC CHECKPOINTING Done regularly to bound re-execution overhead Stop early recycling to create checkpoint (collapse) Freeze PNR (stop recycling) Once ROB head catches up with PNR: Clear Volatile bits from cache Backup architectural registers Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 12/41
PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Restore from Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41
PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41
INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41
OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 16/41
DEFINING THE PNRS Each resource defines its own PNR Based on three ROB pointers: U b : oldest unresolved branch U s : oldest store with unresolved address U l : oldest load with unresolved address B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 17/41
EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41
EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 ST replay check : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41
EARLY Q ENTRY RECYCLING Constraint: Must detect memory replay traps older(u s ): ST- replay trap Not affected by branches PNR Q = U s : ed Early Resource Recycling in Out-of-order Microprocessors Slide 19/41
EARLY REGISTER RECYCLING Constraint: Must not contain potentially useful data older(u s ): Free of ST- replay trap older(u b ): Free of branch misprediction PNR Reg = oldest(u S, U B ) Of course register must be dead (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 20/41
EARLY RECYCLING SUMMARY Resource PNR Value Q entries U S STQ entries oldest(u L, U S, U B ) Registers oldest(u S, U B ) PNR Q PNR STQ PNR Reg B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 21/41
EVALUATION Execution-driven simulation SPEC 2000 applications SGI MIPSPro compiler Compare against: Processor with unlimited resources Three aggressive base processor designs (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 22/41
PROCESSOR MODEL Frequency: 3.2GHz Fetch/issue/commit width: 8/8/12 I. window/rob size: 128/384 Int/FP registers : 192/128 Ld/St units: 2/2 Int/FP/branch units: 7/5/3 Ld/St queue entries: 32/32 MSHRs: 24 Cache L1 L2 Bus & Memory Size: 32KB 512KB RT: 2 cycles 10 cycles Assoc: 4-way 8-way Line size: 64B 128B Ports: 4 1 checkpoint frequency: 5µs Up to 1 taken branch/cycle RAS: 32 entries BTB: 4K entries, 4-way assoc. Branch predictor: Hybrid with speculative update Bimodal size: 8K entries Two-level size: 64K entries FSB frequency: 400MHz FSB width: 128bit Memory: 4-channel Rambus DRAM bandwidth: 6.4GB/s Memory RT: 120ns : ed Early Resource Recycling in Out-of-order Microprocessors Slide 23/41
OVERALL RESULTS 1.36 1.30 Unlimited Speedup 1.24 1.18 1.12 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average 1.06 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41
OVERALL RESULTS 1.36 1.30 Unlimited Speedup 1.24 1.18 1.12 1.06 1.06 Speedup 1.00 1.60 1.54 1.48 1.42 1.36 1.30 1.24 1.18 1.12 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average applu apsi art equake mesa mgrid swim wupwise Average 1.26 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41
OVERALL RESULTS 1.36 1.30 Unlimited Speedup 1.24 1.18 1.12 1.06 1.06 1.12 Speedup 1.00 1.60 1.54 1.48 1.42 1.36 1.30 1.24 1.18 1.12 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average 1.32 1.26 applu apsi art equake mesa mgrid swim wupwise Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41
PNR ADVANCE PNR advance as a % of ROB use Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Register 28.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41
: ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41 PNR ADVANCE PNR advance as a % of ROB use 78.3% 63.4% 67.7% SPECfp Register 28.7% STQ Entry 19.7% (oldest) Head (newest) Tail Q Entry 77.4% SPECint
SUMMARY Decoupling resource recycling from retirement Split ROB into two logical sets Use of ROB+checkpointing simultaneously Early recycling mechanism: Q, STQ, Registers Integration with speculative multithreading (paper) Speedup over baseline 1.06 for SPECint2000 1.26 for SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 26/41
QUESTIONS? José F. Martínez 1 Jose Renau 2 Michael C. Huang 3 Milos Prvulovic 2 Josep Torrellas 2 1 2 3 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 27/41
BACKUP SLIDES : ed Early Resource Recycling in Out-of-order Microprocessors Slide 28/41
RECYCLING PNRS Resource PNR Value Q entries (uniprocessor) U S Q entries (multiprocessor) oldest(u L, U S ) STQ entries oldest(u L, U S, U B ) Registers (uniprocessor) oldest(u S, U B ) Registers (multiprocessor) oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 29/41
EARLY STQ ENTRY RECYCLING Constraint: Must not overwrite potentially useful data older(u l ): Unresolved may need old data older(u s ): No older is replayable older(u b ): Free of branch misprediction PNR STQ = oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 30/41
CHECKPOINT FREQUENCY T o = c k + p e c e (1) p e = T c T e (2) c e = s T c 2 + (s 1) (3) T o T c = c k T c + ( s 1 ) Tc (4) 2 T e T c = c k T e s 1 2 (5) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 31/41
CHECKPOINT SELECTION 5 4 Te=200µs 400µs 600µs 800µs 1000µs To/Tc (%) 3 2 1 0 0 2 4 6 8 10 Tc (µs) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 32/41
SPECINT2000 PARAMETERS SPECint Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) bzip2 29.9 24.3 55.8 19.5 292.3 crafty 28.8 33.4 97.6 28.6 41.9 gcc 19.1 19.0 82.3 17.8 66.9 gzip 28.5 65.5 81.7 8.5 47.1 mcf 30.1 14.6 37.7 13.8 695.6 parser 30.7 26.1 80.7 21.8 109.2 perlbmk 12.2 24.6 89.9 20.5 23.3 vortex 39.3 26.3 87.1 24.9 64.4 vpr 32.9 25.2 83.6 21.5 165.1 Average 27.9 28.7 77.4 19.7 167.3 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 33/41
SPECFP2000 PARAMETERS SPECfp Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) applu 62.2 61.6 62.4 60.7 411.5 apsi 76.8 82.3 83.1 81.6 921.1 art 88.0 54.3 62.6 29.2 1247.3 equake 41.6 61.6 69.1 57.3 135.2 mesa 29.8 35.1 44.6 34.6 33.7 mgrid 65.1 91.5 93.5 91.3 335.9 swim 59.4 64.8 65.4 64.7 949.1 wupwise 71.9 90.3 71.2 87.9 190.7 Average 61.9 67.7 78.3 63.4 528.1 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 34/41
REGISTER ACCESS TIME 2500 2000 Access Time (ps) 1500 1000 500 0 100 200 300 400 500 600 700 800 900 1000 Number of Registers : ed Early Resource Recycling in Out-of-order Microprocessors Slide 35/41
OTHER CONSIDERATIONS (PAPER) Exception on instruction older than youngest PNR: Roll back to checkpoint Each PNR recycles at its own pace Not all PNRs critical, e.g. PNR Q Early register recycling: Must count in-flight consumers (Moudgill et al 1993) Optimal interval between checkpoints : ed Early Resource Recycling in Out-of-order Microprocessors Slide 36/41
CHERRY PARAMETERS checkpoint frequency: 5µs overhead: 60 cycles (18.75ns) Avg time between exceptions: 800µs (not modeled) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 37/41
INGORING U b Speedup 1.36 1.30 1.24 1.18 1.12 1.06 1.00 1.12 1.09 1.06 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average Unlimited Pass Branches : ed Early Resource Recycling in Out-of-order Microprocessors Slide 38/41
PERFECT ANCH PREDICTION Speedup 1.36 1.30 1.24 1.18 1.12 1.26 1.18 1.12 Unlimited Pass Branches 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 39/41
INTERRUPT COLLAPSING Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41
INTERRUPT COLLAPSING Interrupt Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41
INTERRUPT COLLAPSING Interrupt Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41
EQUIVALENT SYSTEM Resource Baseline SPECint2000 SPECfp2000 Q entries 32 80 128 STQ entries 32 64 112 Int Registers 192 240 240 FP Registers 128 128 192 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 41/41