Low-Complexity Reorder Buffer Architecture*
|
|
- Horatio Carson
- 6 years ago
- Views:
Transcription
1 Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS 02), June 24 th 2002 *supported in part by DARPA through the PAC-C program and NSF ICS 02 1
2 Outline ROB complexities Motivation for the low-complexity ROB Low-complexity ROB design Results Concluding remarks ICS 02 2
3 Pentium III-like Superscalar Datapath Instruction Issue Function Units Architectural Register File IQ FU1 F1 Fetch F2 D1 D2 Decode/Dispatch LSQ FU2 FUm EX ROB ARF Instruction dispatch D-cache Result/status forwarding buses ICS 02 3
4 ROB Port Requirements for a W-way CPU Decode/Dispatch W write ports to setup entries Writeback W write ports to write results Dispatch/Issue 2W read ports to read the source operands ROB Commit W read ports for instruction commitment ICS 02 4
5 Where are the Source Values Coming From? Instruction Issue Function Units Architectural Register File F1 F2 D1 1 D2 2 IQ FU1 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 5
6 Where are the Source Values Coming From? 100% Forwarding ARF ROB 62% 32% 6% 80% 60% 40% 20% 0% bzip2 gcc gap gcc mcf parse r perlbmk twolf vortex vpr applu apsi art equake mesa mgrid swim wupwise Avg. Int. Avg. fp. Ave rage 96-entry ROB, 4-way processor SPEC2K Benchmarks ICS 02 6
7 How Efficiently are the Ports Used? Decode/Dispatch W write ports to setup entries Writeback W write ports To write results Dispatch/Issue 2W read ports to read the source operands 6% ROB Commit W read ports for instruction commitment ICS 02 7
8 Approaches to Reducing ROB Complexity Reduce the number of read ports for reading out the source operand values More radical (and better): Completely eliminate the read ports for reading source operand values! ICS 02 8
9 Reducing the Number of Read Ports Average IPC Drop: 1 read port 2 read ports 3.5% 1.0% Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 9
10 Problems with Retaining Fewer Source Read Ports on the ROB Need arbitration for the small number of ports Additional logic needed to block the instructions which could not get the port. Need a switching network to route the operands to correct destinations Multi-cycle access still remains in the critical path of Dispatch/Issue logic ICS 02 10
11 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File F1 F2 D1 1 D2 2 IQ FU1 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 11
12 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File F1 F2 D1 1 D2 2 IQ FU1 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 12
13 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File 1 IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch Instruction dispatch LSQ FUm EX D-cache 3 Result/status forwarding buses ICS 02 13
14 Comparison of ROB Bitcells (0.18µ, TSMC) Layout of a 32-ported SRAM bitcell Layout of a 16-ported SRAM bitcell Area Reduction 71% Shorter bit and wordlines ICS 02 14
15 Our Solution: Elimination of Read Ports Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch D-cache Result/status forwarding buses Area Reduction 45% ICS 02 15
16 Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation Power is reduced because: shorter bitlines and wordlines lower capacitive loading fewer decoders fewer drivers and sense amps ICS 02 16
17 Completely Eliminating the Source Read Ports on the ROB The Problem: Issue of instructions that require a value stored in the ROB will stall Solutions: Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING ICS 02 17
18 Late Forwarding: Use the Normal Forwarding Buses! Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch Result/status forwarding buses: D-cache ICS 02 18
19 Late Forwarding: Use the Normal Forwarding Buses! Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch Result/status forwarding buses: D-cache ICS 02 19
20 Optimizing Late Forwarding PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance SOLUTION: Selective Late Forwarding (SLF) SLF requires additional bit in the ROB That bit is set by the dispatched instructions that require Late Forwarding No additional forwarding buses are needed, since SLF traffic is very small ICS 02 20
21 Late Forwarding: Use the Normal Forwarding Buses! Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch Result/status forwarding buses: D-cache Only 3.5% of the traffic is from SELECTIVE LATE FORWARDING ICS 02 21
22 Performance Drop of Simplified ROB Performance Drop % No ROB read ports with SLF 1 read port 2 read ports Average IPC Drop: 9.6% 3.5% 1.0% 17% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. 37% applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 22
23 IPC Penalty: Source Value Not Accessible within the ROB Result Generation Forwarding Lifetime of a Result Value Late Forwarding/ Commitment Value within ROB Value within ARF time ICS 02 23
24 Improving IPC with No Read Ports Cache recently generated values in a set of RETENTION LATCHES (RL) Retention Latches are SMALL and FAST Only 8 to 16 latches needed in the set Entire set has 1 or 2 read ports ICS 02 24
25 Datapath with the Retention Latches Instruction Issue Function Units Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch D-cache Result/status forwarding buses ICS 02 25
26 Datapath with the Retention Latches Instruction Issue Function Units RETENTION LATCHES Architectural Register File IQ FU1 F1 F2 D1 D2 FU2 ROB ARF Fetch Decode/Dispatch LSQ FUm EX Instruction dispatch D-cache Result/status forwarding buses ICS 02 26
27 The Structure of the Retention Latch Set 8 or 16 latches L recently-written results (L=1 or 2 works great) Status Result Values L-ported CAM field (key = ROB_slot_id) W write ports for writing up to W results in parallel L ROB slot addresses (L=1 or 2) ICS 02 27
28 Retention Latch Management Strategies FIFO 8 entry RL: 42% hit rate 16 entry RL: 55% hit rate LRU 8 entry RL: 56% hit rate 16 entry RL: 62% hit rate Random Replacement Worse performance than FIFO ICS 02 28
29 Hit Ratios to Retention Latches 100 Average Hit Ratio: FIFO 8 2 FIFO 16 2 LRU 8 2 LRU % 55% 56% 62% Hit Ratios bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 29
30 Accessing Retention Latch Entries ROB index is used as a unique key in the Retention Latches to search the result values Need to maintain unique keys even when we have: Reuse of a ROB slot: Not a problem for FIFO simply flush a RL entry at commit time for LRU Branch mispredictions ICS 02 30
31 Handling Branch Mispredictions Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed Uses branch tags Complicated implementation Complete RL Flushing: All retention latch entries are flushed Very simple implementation Performance drop is only 1.5% compared to selective flushing ICS 02 31
32 Misprediction Handling: Performance Selective flushing Complete flushing IPC Average IPC Drop: 1.5% bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg. ICS 02 32
33 Experimental Setup: the AccuPower (DATE 02) Compiled SPEC benchmarks Datapath specs Microarchitectural Simulator (Rooted in SimpleScalar) Performance stats Transition counts, Context information VLSI layout data SPICE deck SPICE Energy/Power Estimator Power/energy stats SPICE measures of energy per transition ICS 02 33
34 Configuration of the Simulated System Machine width Issue Queue 4-way 32 entries Reorder Buffer Load/Store Queue 96 entries 32 entries Simulated the execution of SPEC2000 benchmarks ICS 02 34
35 Assumed Timings Smaller delay: few latches Rename Table lookup for ROB index Source operand read from the ROB Source operand read from the ROB Rename Table Lookup for ROB index Associative lookup of operand from retention latches using ROB index as a key D1 D2 D3 D1 D2 Timing of the baseline model Timing of the simplified ROB ICS 02 35
36 Experimental Results: Effect on Performance Avg. IPC Drop: ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU 0.1% -1.6% -1.0% -2.3% Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 36
37 Experimental Results: Effect on Performance Avg. IPC Drop: ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU 3.3% 1.7% 2.3% 1.0% Performance Drop % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 37
38 Experimental Results: Effect on Power Avg. Savings: No ROB ports 8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU 30% 23.4% 22.2% 21% 20.2% Power Savings % bzip2 gap gcc gzip mcf parser perl twolf vortex vpr Int Avg. applu apsi art equake mesa mgrid swim wupwise FP Avg. ICS 02 38
39 Summary of Results Significantly reduced ROB complexity and power dissipation 45% area reduction 20% to 30% power reduction across SPEC 2000 benchmarks Actual IPC improvements: 1.6% to 2.3% gain across SPEC benchmarks IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access) ICS 02 39
40 Related Work Value-Aging Buffer (Hu & Martonosi, PACS 2000) Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA 02) Multiple Register Banks (Cruz et.al., ISCA 00 & Balasubramonian et.al., MICRO 01) See paper for discussions ICS 02 40
41 Conclusions Typical source operand location statistics can be successfully exploited to reduce ROB complexity Significant reduction in ROB area and power no ROB ports needed for reading source operands IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle ICS 02 41
42 Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th Annual ACM International Conference on Supercomputing (ICS 02), June 24 th 2002 *supported in part by DARPA through the PAC-C program and NSF ICS 02 42
Reducing Reorder Buffer Complexity Through Selective Operand Caching
Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev
More informationIEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE Complexity-Effective Reorder Buffer Designs for Superscalar Processors
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 6, JUNE 2004 653 Complexity-Effective Reorder Buffer Designs for Superscalar Processors Gurhan Kucuk, Student Member, IEEE, Dmitry V. Ponomarev, Member, IEEE,
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationDistributed Reorder Buffer Schemes for Low Power
Distributed Reorder Buffer Schemes for Low Power Gurhan Kucuk Oguz Ergin Dmitry Ponomarev Kanad Ghose Department of Computer Science State University of New York, Binghamton, NY 13902 6000 e mail:{gurhan,
More informationInstruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers
Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA
More informationNon-Uniform Instruction Scheduling
Non-Uniform Instruction Scheduling Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA {jsharke, dima}@cs.binghamton.edu Abstract.
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen
More informationExploring Wakeup-Free Instruction Scheduling
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance
More informationReducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources*
Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan
More informationReducing Latencies of Pipelined Cache Accesses Through Set Prediction
Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationAccuPower: An Accurate Power Estimation Tool for Superscalar Microprocessors*
Appears in the Proceedings of Design, Automation and Test in Europe Conference, March 2002 AccuPower: An Accurate Power Estimation Tool for Superscalar Microprocessors* Dmitry Ponomarev, Gurhan Kucuk and
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationPower Reduction in Superscalar Datapaths Through Dynamic Bit Slice Activation *
Power Reduction in Superscalar Datapaths Through Dynamic Bit Slice Activation * Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science State University of New York, Binghamton, NY 13902
More informationEnergy Efficient Asymmetrically Ported Register Files
Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University
More informationBanked Multiported Register Files for High-Frequency Superscalar Microprocessors
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation
More informationENERGY EFFICIENT INSTRUCTION DISPATCH BUFFER DESIGN FOR SUPERSCALAR PROCESSORS*
ENERGY EFFICIENT INSTRUCTION DISPATCH BUFFER DESIGN FOR SUPERSCALAR PROCESSORS* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY
More informationEnergy Efficient Instruction Dispatch Buffer Design for Superscalar Processors*
Energy Efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY
More informationCS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science
CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationPrecise Instruction Scheduling
Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University
More informationBalanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders
Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders Chuanjun Zhang Department of Computer Science and Electrical Engineering University of Missouri-Kansas City
More informationNarrow Width Dynamic Scheduling
Journal of Instruction-Level Parallelism 9 (2007) 1-23 Submitted 10/06; published 4/07 Narrow Width Dynamic Scheduling Erika Gunadi Mikko H. Lipasti Department of Electrical and Computer Engineering 1415
More informationBloom Filtering Cache Misses for Accurate Data Speculation and Prefetching
Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering
More informationUsing Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation
Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical
More informationDefining Wakeup Width for Efficient Dynamic Scheduling
Defining Wakeup Width for Efficient Dynamic Scheduling Aneesh Aggarwal ECE Depment Binghamton University Binghamton, NY 9 aneesh@binghamton.edu Manoj Franklin ECE Depment and UMIACS University of Maryland
More informationRegister Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin * Intel Barcelona Research Center Intel Labs, UPC, Barcelona, Spain oguzx.ergin@intel.com Deniz Balkan,
More informationProgram Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency
Program Phase Directed Dynamic Cache Reconfiguration for Power Efficiency Subhasis Banerjee Diagnostics Engineering Group Sun Microsystems Bangalore, INDIA E-mail: subhasis.banerjee@sun.com Surendra G
More informationExploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors
Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili Sibi Govindan Doug Burger Stephen W. Keckler beroy@cs.utexas.edu sibi@cs.utexas.edu dburger@microsoft.com skeckler@nvidia.com
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationBoost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor
Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationA Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST
A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable
More informationDynamically Controlled Resource Allocation in SMT Processors
Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona
More informationImproving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration
Int J Parallel Prog (2010) 38:203 224 DOI 10.1007/s10766-010-0128-3 Improving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration Tameesh Suri Aneesh Aggarwal Received:
More informationAn Optimized Front-End Physical Register File with Banking and Writeback Filtering
An Optimized Front-End Physical Register File with Banking and Writeback Filtering Miquel Pericàs, Ruben Gonzalez, Adrian Cristal, Alex Veidenbaum and Mateo Valero Technical University of Catalonia, University
More informationMotivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture
Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationEnergy-Effective Instruction Fetch Unit for Wide Issue Processors
Energy-Effective Instruction Fetch Unit for Wide Issue Processors Juan L. Aragón 1 and Alexander V. Veidenbaum 2 1 Dept. Ingen. y Tecnología de Computadores, Universidad de Murcia, 30071 Murcia, Spain
More informationCSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading
CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,
More informationCluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationLocality-Based Information Redundancy for Processor Reliability
Locality-Based Information Redundancy for Processor Reliability Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov, zhou}@cs.ucf.edu
More informationDemand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores
Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Mary D. Brown Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin {mbrown,patt}@ece.utexas.edu
More informationHigh Performance Memory Requests Scheduling Technique for Multicore Processors
High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical
More informationUsing a Serial Cache for. Energy Efficient Instruction Fetching
Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,
More informationIntegrated CPU and Cache Power Management in Multiple Clock Domain Processors
Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC
More informationImplicitly-Multithreaded Processors
Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University
More informationWish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department
More informationImplicitly-Multithreaded Processors
Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract
More informationAddress-Indexed Memory Disambiguation and Store-to-Load Forwarding
Copyright c 2005 IEEE. Reprinted from 38th Intl Symp Microarchitecture, Nov 2005, pp. 171-182. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted.
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationPerformance Oriented Prefetching Enhancements Using Commit Stalls
Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationInstruction Based Memory Distance Analysis and its Application to Optimization
Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological
More informationMEMORY ORDERING: A VALUE-BASED APPROACH
MEMORY ORDERING: A VALUE-BASED APPROACH VALUE-BASED REPLAY ELIMINATES THE NEED FOR CONTENT-ADDRESSABLE MEMORIES IN THE LOAD QUEUE, REMOVING ONE BARRIER TO SCALABLE OUT- OF-ORDER INSTRUCTION WINDOWS. INSTEAD,
More informationLoose Loops Sink Chips
Loose Loops Sink Chips Eric Borch Intel Corporation, VSSAD eric.borch@intel.com Eric Tune University of California, San Diego Department of Computer Science etune@cs.ucsd.edu Srilatha Manne Joel Emer Intel
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationExploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture
Appeared in Journal of Systems Architecture, Volume 55, Issue 3, pp. 188-195, 2009, ISSN 1383-7621 Exploiting Selective Instruction Reuse and Value Prediction in a Superscalar Architecture Arpad Gellert,
More informationExploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students
More informationArea-Efficient Error Protection for Caches
Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts
More informationThe Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation
Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:
More informationRingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2006-066 September 18, 2006 RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture Jessica H.
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationCS152 Computer Architecture and Engineering. Complex Pipelines
CS152 Computer Architecture and Engineering Complex Pipelines Assigned March 6 Problem Set #3 Due March 20 http://inst.eecs.berkeley.edu/~cs152/sp12 The problem sets are intended to help you learn the
More informationDrowsy Instruction Caches
Drowsy Instruction Caches Leakage Power Reduction using Dynamic Voltage Scaling and Cache Sub-bank Prediction Nam Sung Kim, Krisztián Flautner, David Blaauw, Trevor Mudge {kimns, blaauw, tnm}@eecs.umich.edu
More informationKoji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency
Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch
More informationWhich is the best? Measuring & Improving Performance (if planes were computers...) An architecture example
1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles
More informationDynamic Memory Dependence Predication
Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is
More informationA Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures
A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures Qiong Cai Josep M. Codina José González Antonio González Intel Barcelona Research Centers, Intel-UPC {qiongx.cai, josep.m.codina,
More informationDynamic Scheduling with Narrow Operand Values
Dynamic Scheduling with Narrow Operand Values Erika Gunadi University of Wisconsin-Madison Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706 Submitted in partial
More informationFiltering of Unnecessary Branch Predictor Lookups for Low-power Processor Architecture *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 1127-1142 (2008) Filtering of Unnecessary Branch Predictor Lookups for Low-power Processor Architecture * Department of Computer Science National Chiao
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationSuperscalar Processor Design
Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction
More informationSoftware-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain
Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology
More informationPower/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering.
Power/Performance Advantages of Victim Buer in High-Performance Processors Gianluca Albera xy x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY 10129 R. Iris Bahar y y Brown University
More informationPartitioning Multi-Threaded Processors with a Large Number of Threads Λ Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?
Partitioning Multi-Threaded Processors with a Large Number of Threads Λ Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?? Departments of Electrical and Computer Engineering and
More informationFast Branch Misprediction Recovery in Out-of-order Superscalar Processors
Fast Branch Misprediction Recovery in Out-of-order Superscalar Processors Peng Zhou Soner Önder Steve Carr Department of Computer Science Michigan Technological University Houghton, Michigan 4993-295 {pzhou,soner,carr}@mtuedu
More informationTODAY S superscalar processor microarchitectures place. Execution Cache-Based Microarchitecture for Power-Efficient Superscalar Processors
14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 1, JANUARY 2005 Execution Cache-Based Microarchitecture for Power-Efficient Superscalar Processors Emil Talpes and Diana
More informationIntroduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures.
Introduction Power- and Performance -Aware Architectures PhD. candidate: Ramon Canal Corretger Advisors: Antonio onzález Colás (UPC) James E. Smith (U. Wisconsin-Madison) Departament d Arquitectura de
More informationVSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power
VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power Hai Li, Chen-Yong Cher, T. N. Vijaykumar, and Kaushik Roy 1285 EE Building, ECE Department, Purdue University @ecn.purdue.edu
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationA Complexity-Effective Out-of-Order Retirement Microarchitecture
1 A Complexity-Effective Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, R. Ubal, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia,
More informationMicroarchitectural Techniques to Reduce Interconnect Power in Clustered Processors
Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani, Naveen Muralimanohar, Rajeev Balasubramonian Department of Electrical and Computer Engineering School
More informationLimiting the Number of Dirty Cache Lines
Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology
More informationExploiting Streams in Instruction and Data Address Trace Compression
Exploiting Streams in Instruction and Data Address Trace Compression Aleksandar Milenkovi, Milena Milenkovi Electrical and Computer Engineering Dept., The University of Alabama in Huntsville Email: {milenka
More informationPortland State University ECE 587/687. Superscalar Issue Logic
Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are
More informationSecurity-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat
Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance
More informationCache Pipelining with Partial Operand Knowledge
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin - Madison {egunadi,mikko}@ece.wisc.edu Abstract
More informationFast branch misprediction recovery in out-oforder superscalar processors
See discussions, stats, and author profiles for this publication at: https://wwwresearchgatenet/publication/2223623 Fast branch misprediction recovery in out-oforder superscalar processors CONFERENCE PAPER
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationKilo-instruction Processors, Runahead and Prefetching
Kilo-instruction Processors, Runahead and Prefetching Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2 and Mateo Valero 1,3 1 Departamento de Arquitectura de Computadores UPC Barcelona 2 Departamento
More information