José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

Similar documents
Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors Λ

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Are we ready for high-mlp?

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Design of Experiments - Terminology

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Execution-based Prediction Using Speculative Slices

Precise Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling

Low-Complexity Reorder Buffer Architecture*

POSH: A TLS Compiler that Exploits Program Structure

Narrow Width Dynamic Scheduling

Are We Ready for High Memory-Level Parallelism?

Many Cores, One Thread: Dean Tullsen University of California, San Diego

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Dynamic Memory Dependence Predication

Deconstructing Commit

A Complexity-Effective Out-of-Order Retirement Microarchitecture

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Thread-Level Speculation on a CMP Can Be Energy Efficient

Thread-Level Speculation on a CMP Can Be Energy Efficient

The Validation Buffer Out-of-Order Retirement Microarchitecture

Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Implementation Issues of Slackened Memory Dependence Enforcement

ECE404 Term Project Sentinel Thread

Kilo-instructions Processors

Speculative Multithreaded Processors

Microarchitecture Overview. Performance

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY

Fast Branch Misprediction Recovery in Out-of-order Superscalar Processors

Kilo-instruction Processors, Runahead and Prefetching

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

Fast branch misprediction recovery in out-oforder superscalar processors

Dynamic Scheduling with Narrow Operand Values

Understanding The Effects of Wrong-path Memory References on Processor Performance

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Microarchitecture Overview. Performance

High Performance Memory Requests Scheduling Technique for Multicore Processors

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

L1 Data Cache Decomposition for Energy Efficiency

Scalable Cache Miss Handling for High Memory-Level Parallelism

Dynamically Controlled Resource Allocation in SMT Processors

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Computer Science 146. Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Performance Oriented Prefetching Enhancements Using Commit Stalls

Using a Serial Cache for. Energy Efficient Instruction Fetching

Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking

Speculative Parallelization in Decoupled Look-ahead

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Speculative Parallelization in Decoupled Look-ahead

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Analysis of the TRIPS Prototype Block Predictor

Implicitly-Multithreaded Processors

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Implicitly-Multithreaded Processors

Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Use-Based Register Caching with Decoupled Indexing

Multithreaded Value Prediction

Itanium 2 Processor Microarchitecture Overview

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

Memory Ordering: A Value-Based Approach

BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution

An Architectural Characterization Study of Data Mining and Bioinformatics Workloads

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

PowerPC 620 Case Study

Dynamic Speculative Precomputation

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Power-Efficient Approaches to Reliability. Abstract

MEMORY ORDERING: A VALUE-BASED APPROACH

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Cluster Assignment Strategies for a Clustered Trace Cache Processor

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Speculative Multithreaded Processors

An Optimized Front-End Physical Register File with Banking and Writeback Filtering

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

Transcription:

CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3

MOTIVATION Problem: Limited processor resources Goal: More efficient use by aggressive recycling Opportunity: Resources reserved until retirement Solution: Decouple recycling from retirement : ed Early Resource Recycling in Out-of-order Microprocessors Slide 2/41

CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Register Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry Add STQ Entry Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Add Register Conventional ROB Tail (newest) Add Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Add ST Register ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

CURRENT PROCESSORS Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 3/41

PROPOSAL: EARLY RECYCLING Decouple resource recycling from instruction retirement Recycle when no longer needed Targeted resources: Load queue entries Store queue entries Physical registers Potential when targeted resources are unlimited: 1.12 speedup in SPECint2000 1.32 speedup in SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 4/41

EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

EARLY RECYCLING Decouple resource recycling from retirement Q Entry STQ Entry Register Add ST ST ST ST Add Conventional ROB Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 5/41

CONSTRAINT: PRECISE STATE Events that require precise state: Branch mispredictions Memory replay traps Exceptions Interrupts Early recycling makes this difficult E.g. recycled registers no longer available E.g. memory may be overwriten prematurely : ed Early Resource Recycling in Out-of-order Microprocessors Slide 6/41

PROPOSAL: POINT OF NO RETURN Classify events according to frequency: Frequent: branch mispredictions, memory replays Infrequent: exceptions Split ROB into two logical sets of instructions Irreversible set Subject to infrequent events only recovery Reversible set: Subject to frequent and infrequent events Conventional ROB recovery mechanisms Boundary: Point of No Return (PNR) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 7/41

CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41

CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41

CONVENTIONAL VS CHERRY ROB Reversible Conventional ROB Reversible Irreversible Head (oldest) ROB Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 8/41

OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 9/41

TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

TIMELINE Timeline Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

TIMELINE Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

TIMELINE Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

TIMELINE Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 10/41

ENTERING CHERRY MODE Backup architectural registers Can be done in a handful of cycles Allow PNR to race ahead of ROB head Updates to cache hierarchy set Volatile bit Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 11/41

PERIODIC CHECKPOINTING Done regularly to bound re-execution overhead Stop early recycling to create checkpoint (collapse) Freeze PNR (stop recycling) Once ROB head catches up with PNR: Clear Volatile bits from cache Backup architectural registers Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 12/41

PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Reversible Exception Irreversible Restore from Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Timeline Conventional ROB Exception : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

PRECISE EXCEPTION HANDLING Exception in Irreversible set: Roll back to checkpoint Temporarily disable recycling (for some time) Re-execute in conventional OOO mode (w/o ) Allow exception to re-occur Rollback Re exec Timeline Conventional ROB Exception Conventional ROB : ed Early Resource Recycling in Out-of-order Microprocessors Slide 13/41

PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Irreversible Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

PRECISE EXCEPTION HANDLING Exception in Reversible set: Disable resource recycling (freeze PNR) Process exception Take new checkpoint; re-enter mode Exception Reversible Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 14/41

INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

INTERRUPT HANDLING Interrupts are asynchronous: delay handling Disable resource recycling (freeze PNR) Process interrupt Take new checkpoint and re-enter mode Process Interrupt Timeline Interrupt : ed Early Resource Recycling in Out-of-order Microprocessors Slide 15/41

OUTLINE Motivation Overview Implementation ing Exception and interrupt handling Early resource recycling Results : ed Early Resource Recycling in Out-of-order Microprocessors Slide 16/41

DEFINING THE PNRS Each resource defines its own PNR Based on three ROB pointers: U b : oldest unresolved branch U s : oldest store with unresolved address U l : oldest load with unresolved address B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 17/41

EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41

EARLY Q ENTRY RECYCLING Us L2 S0 L1 LQQ Head (oldest) L2 L1 ST replay check : ed Early Resource Recycling in Out-of-order Microprocessors Slide 18/41

EARLY Q ENTRY RECYCLING Constraint: Must detect memory replay traps older(u s ): ST- replay trap Not affected by branches PNR Q = U s : ed Early Resource Recycling in Out-of-order Microprocessors Slide 19/41

EARLY REGISTER RECYCLING Constraint: Must not contain potentially useful data older(u s ): Free of ST- replay trap older(u b ): Free of branch misprediction PNR Reg = oldest(u S, U B ) Of course register must be dead (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 20/41

EARLY RECYCLING SUMMARY Resource PNR Value Q entries U S STQ entries oldest(u L, U S, U B ) Registers oldest(u S, U B ) PNR Q PNR STQ PNR Reg B S L S B L Tail (newest) Us Ul Ub Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 21/41

EVALUATION Execution-driven simulation SPEC 2000 applications SGI MIPSPro compiler Compare against: Processor with unlimited resources Three aggressive base processor designs (paper) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 22/41

PROCESSOR MODEL Frequency: 3.2GHz Fetch/issue/commit width: 8/8/12 I. window/rob size: 128/384 Int/FP registers : 192/128 Ld/St units: 2/2 Int/FP/branch units: 7/5/3 Ld/St queue entries: 32/32 MSHRs: 24 Cache L1 L2 Bus & Memory Size: 32KB 512KB RT: 2 cycles 10 cycles Assoc: 4-way 8-way Line size: 64B 128B Ports: 4 1 checkpoint frequency: 5µs Up to 1 taken branch/cycle RAS: 32 entries BTB: 4K entries, 4-way assoc. Branch predictor: Hybrid with speculative update Bimodal size: 8K entries Two-level size: 64K entries FSB frequency: 400MHz FSB width: 128bit Memory: 4-channel Rambus DRAM bandwidth: 6.4GB/s Memory RT: 120ns : ed Early Resource Recycling in Out-of-order Microprocessors Slide 23/41

OVERALL RESULTS 1.36 1.30 Unlimited Speedup 1.24 1.18 1.12 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average 1.06 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41

OVERALL RESULTS 1.36 1.30 Unlimited Speedup 1.24 1.18 1.12 1.06 1.06 Speedup 1.00 1.60 1.54 1.48 1.42 1.36 1.30 1.24 1.18 1.12 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average applu apsi art equake mesa mgrid swim wupwise Average 1.26 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41

OVERALL RESULTS 1.36 1.30 Unlimited Speedup 1.24 1.18 1.12 1.06 1.06 1.12 Speedup 1.00 1.60 1.54 1.48 1.42 1.36 1.30 1.24 1.18 1.12 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average 1.32 1.26 applu apsi art equake mesa mgrid swim wupwise Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 24/41

PNR ADVANCE PNR advance as a % of ROB use Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

PNR ADVANCE PNR advance as a % of ROB use Q Entry 77.4% SPECint STQ Entry 19.7% Register 28.7% Tail (newest) Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41

: ed Early Resource Recycling in Out-of-order Microprocessors Slide 25/41 PNR ADVANCE PNR advance as a % of ROB use 78.3% 63.4% 67.7% SPECfp Register 28.7% STQ Entry 19.7% (oldest) Head (newest) Tail Q Entry 77.4% SPECint

SUMMARY Decoupling resource recycling from retirement Split ROB into two logical sets Use of ROB+checkpointing simultaneously Early recycling mechanism: Q, STQ, Registers Integration with speculative multithreading (paper) Speedup over baseline 1.06 for SPECint2000 1.26 for SPECfp2000 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 26/41

QUESTIONS? José F. Martínez 1 Jose Renau 2 Michael C. Huang 3 Milos Prvulovic 2 Josep Torrellas 2 1 2 3 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 27/41

BACKUP SLIDES : ed Early Resource Recycling in Out-of-order Microprocessors Slide 28/41

RECYCLING PNRS Resource PNR Value Q entries (uniprocessor) U S Q entries (multiprocessor) oldest(u L, U S ) STQ entries oldest(u L, U S, U B ) Registers (uniprocessor) oldest(u S, U B ) Registers (multiprocessor) oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 29/41

EARLY STQ ENTRY RECYCLING Constraint: Must not overwrite potentially useful data older(u l ): Unresolved may need old data older(u s ): No older is replayable older(u b ): Free of branch misprediction PNR STQ = oldest(u L, U S, U B ) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 30/41

CHECKPOINT FREQUENCY T o = c k + p e c e (1) p e = T c T e (2) c e = s T c 2 + (s 1) (3) T o T c = c k T c + ( s 1 ) Tc (4) 2 T e T c = c k T e s 1 2 (5) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 31/41

CHECKPOINT SELECTION 5 4 Te=200µs 400µs 600µs 800µs 1000µs To/Tc (%) 3 2 1 0 0 2 4 6 8 10 Tc (µs) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 32/41

SPECINT2000 PARAMETERS SPECint Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) bzip2 29.9 24.3 55.8 19.5 292.3 crafty 28.8 33.4 97.6 28.6 41.9 gcc 19.1 19.0 82.3 17.8 66.9 gzip 28.5 65.5 81.7 8.5 47.1 mcf 30.1 14.6 37.7 13.8 695.6 parser 30.7 26.1 80.7 21.8 109.2 perlbmk 12.2 24.6 89.9 20.5 23.3 vortex 39.3 26.3 87.1 24.9 64.4 vpr 32.9 25.2 83.6 21.5 165.1 Average 27.9 28.7 77.4 19.7 167.3 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 33/41

SPECFP2000 PARAMETERS SPECfp Apps Used Irreversible Set Collapse ROB (% of Used ROB) Step (%) Reg LQ SQ (Cycles) applu 62.2 61.6 62.4 60.7 411.5 apsi 76.8 82.3 83.1 81.6 921.1 art 88.0 54.3 62.6 29.2 1247.3 equake 41.6 61.6 69.1 57.3 135.2 mesa 29.8 35.1 44.6 34.6 33.7 mgrid 65.1 91.5 93.5 91.3 335.9 swim 59.4 64.8 65.4 64.7 949.1 wupwise 71.9 90.3 71.2 87.9 190.7 Average 61.9 67.7 78.3 63.4 528.1 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 34/41

REGISTER ACCESS TIME 2500 2000 Access Time (ps) 1500 1000 500 0 100 200 300 400 500 600 700 800 900 1000 Number of Registers : ed Early Resource Recycling in Out-of-order Microprocessors Slide 35/41

OTHER CONSIDERATIONS (PAPER) Exception on instruction older than youngest PNR: Roll back to checkpoint Each PNR recycles at its own pace Not all PNRs critical, e.g. PNR Q Early register recycling: Must count in-flight consumers (Moudgill et al 1993) Optimal interval between checkpoints : ed Early Resource Recycling in Out-of-order Microprocessors Slide 36/41

CHERRY PARAMETERS checkpoint frequency: 5µs overhead: 60 cycles (18.75ns) Avg time between exceptions: 800µs (not modeled) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 37/41

INGORING U b Speedup 1.36 1.30 1.24 1.18 1.12 1.06 1.00 1.12 1.09 1.06 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average Unlimited Pass Branches : ed Early Resource Recycling in Out-of-order Microprocessors Slide 38/41

PERFECT ANCH PREDICTION Speedup 1.36 1.30 1.24 1.18 1.12 1.26 1.18 1.12 Unlimited Pass Branches 1.06 1.00 bzip2 crafty gcc gzip mcf parser perlbmk vortex vpr Average : ed Early Resource Recycling in Out-of-order Microprocessors Slide 39/41

INTERRUPT COLLAPSING Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41

INTERRUPT COLLAPSING Interrupt Point of No Return Head (oldest) : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41

INTERRUPT COLLAPSING Interrupt Point of No Return & Head : ed Early Resource Recycling in Out-of-order Microprocessors Slide 40/41

EQUIVALENT SYSTEM Resource Baseline SPECint2000 SPECfp2000 Q entries 32 80 128 STQ entries 32 64 112 Int Registers 192 240 240 FP Registers 128 128 192 : ed Early Resource Recycling in Out-of-order Microprocessors Slide 41/41