Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Size: px
Start display at page:

Download "Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors"

Transcription

1 : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth Stream-based Performance results Future work Related work Conclusion 2

2 : Motivation Macro-code vs. Micro-code Architectural vs. micro-architectural Compatibility vs. implementation-specific legacy-free : best of both Intermediate binary code representation Architecturally visible Implementation specific Legacy-free Like Pentium processor performance monitor interface 3 : Example Original Code T T T A B C D Trigger FT FT FT CFG A B C D Trigger i i+ i+2 i+3 i+4 Boundary i+5 i+6 Boundary i+n i+n+ i+n+2 i+n+3 Layout Original Code A D C A B C D B Appendix 4 2

3 : Benefits Modest hardware change Decode logic to recognize trigger and boundaries Unobtrusive to original code Appendix to original code d binary is guaranteed to be correct Backwards compatible Legacy free 5 For Itanium ISA Target : Future OOO Itanium machine Address code-density and fetch bandwidth issues Itanium 6 3

4 Itanium Itanium ISA puts instructions in bundles according to pre-defined templates Each bundle has 3 instruction slots and template Slot 0 Slot Slot 2 trigger: NOP with location encoded in offset boundary: use 2 out of 8 undefined templates 7 Improving Fetch Bandwidth Two ways in which fetch bandwidth can degrade Wasteful instructions in the code stream Reduced code density Occupy pipeline resources Instruction cache misses Remedy Time spent waiting for cache fill Reduce wasteful instructions Reduce I-Cache misses 8 4

5 Wasteful Instructions NOPs Predicated false instructions Effectively become NOPs in dynamic code stream Slot 0 Slot Pred Slot NOP False 2 9 Wasteful Instructions Cache line misaligned instructions Branch targets must be bundle-aligned Taken branches might not be bundle-aligned 2-bundle width cache line Slot 0 Taken Slot Branch Slot 2 Branch Slot Target 0 Slot Slot 2 3 wasteful instructions 4 wasteful instructions 0 5

6 Wasteful Instruction Profile Percent of Total Instructions Percentage Wasteful Instructions gzip gcc mcf crafty parser gap average (27b) (3b) (4.7b) (2.5b) (7.45b) (20b) (50b) Not Wasteful Misalignment due to Taken Branch Misalignment due to Branch Target Pred False Nop Benchmark (total inst count) On average, nearly 3 of fetched instructions are wasteful Stream-Based Streams Instructions between two retired taken branches static streams cover 9 execution 2 64.gzip 2 76.gcc mcf crafty 0 97.parser gap

7 Streams vs. Basic Blocks Size Streams are bigger than basic blocks Location Streams reside in section Basic blocks reside in original code Transitions and Prediction Branch prediction: blockblock, block-stream Stream prediction: streamstream, stream-block Number of Instructions Average Basic Block and Stream Sizes gzip gcc mcf crafty parser gap average inst per basic block inst per stream 3 Stream Prediction Traditional Branch Predictor Original Code Traditional Branch Predictor Stream Predictor Stream Predictor Enables more aggressive instruction prefetch Predict the next stream instead of basic block Fetch further into code stream: different from regular prefetching 4 7

8 Stream Prefetching Stream stream-stream prepare to branch Prefetch next stream 5 Experimental Setup Pipeline Structure Branch Predictor In-order: 2 stage Out-of-order: 6 stage 2K entry GSHARE with 256 entry, 4-way BTB Instruction Queue Execute Bandwidth Cache Structure Memory Latency 2K entry GSHARE with 32 entry, 4-way BTB 8 bundle (24-instruction) queue 6 instructions Out-of-order: 8 instruction schedule window L (separate I and D): 6K 4-way, 8 way banked, 2 cycle latency L2 (shared): 256K 4-way, 8 way banked, 4 cycle latency L3 (shared): 3072K 2-way, way banked, 30 cycle latency Fill buffer (MSHR): 6 entries. All caches have 64 byte lines 230 cycle latency, TLB Miss Penalty 30 cycles 6 8

9 Objectives of Experiment. Quantify benefit of code density reduction Cache alignment and NOP removal 2. Quantify benefit of branch and stream prediction 3. Quantify benefit of stream prefetching Stream prefetching by prepare to branch 7 Configurations optimizations Cache alignment and NOP removal Microarchitecturalvariations Name Stream Predictor Branch Predictor Stream Prefetch No Stream Prefetch No No No Stream Prefetch Hybrid Predictor No Stream Prefetch No Stream Prefetch Hybrid Predictor 8 9

10 In-Order Performance.8 Performance Improvement of In-order Execution with Speedup No Stream Prefetch No Stream Prefetch Hybrid Predictor Stream Prefetch Stream Prefetch Hybrid Predictor 0 Fetch bandwidth is not a critical resource for in-order pipelines 5% average speedup crafty gap gcc gzip mcf parser average Benchmark 9 Out-of-Order Performance 2.5 Performance Improvement of Out-of-Order Execution with Speedup No Stream Prefetch No Stream Prefetch Hybrid Predictor Stream Prefetch Stream Prefetch Hybrid Predictor 0 Fetch bandwidth is critical for out-of-order pipelines 32% average speedup crafty gap gcc gzip mcf parser average Benchmark enables prefetching to reduce I-cache miss 20 0

11 Future Work Other optimization via Continuing with the fetch bandwidth theme Removal of predicate false instructions Alternative encodings Dependency encoding Dynamic construction Larger workloads Database applications Managed runtime environments (e.g. Java/.Net) 2 Related Work Trace cache: Microarchitectural E. Rotenberg, S. Bennett, and J. E. Smith. Trace Cache: a low latency approach to high bandwidth instruction fetching. replay: Microarchitectural B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. J. Patel, and S. S. Lumetta. Performance Characterization of a Hardware Framework for Dynamic Optimization. Dynamo: Software, Architectural V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A Transparent Dynamic Optimization System. Spike: Software, Architectural A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. L. Larriba-Pey, P. G. Lowney, and M. Valero. Code Layout Optimizations for Transaction Processing Workloads. 22

12 Conclusion as an intermediate encoding Architecturally visible Implementation specific Legacy-free for Itanium : Stream-based Target fetch bandwidth Code density Additional prefetchopportunity 5% speedup on in-order 32% speedup on out-of-order 23 2

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Improving Instruction Delivery with a Block-Aware ISA

Improving Instruction Delivery with a Block-Aware ISA 530 Improving Instruction Delivery with a Block-Aware ISA Ahmad Zmily, Earl Killian, and Christos Kozyrakis Electrical Engineering Department Stanford University {zmily,killian,kozyraki}@stanford.edu Abstract.

More information

Dynamic Code Value Specialization Using the Trace Cache Fill Unit

Dynamic Code Value Specialization Using the Trace Cache Fill Unit Dynamic Code Value Specialization Using the Trace Cache Fill Unit Weifeng Zhang Steve Checkoway Brad Calder Dean M. Tullsen Department of Computer Science and Engineering University of California, San

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: Why: Who: 2 HPC-oriented

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

More information

Multiple Stream Prediction

Multiple Stream Prediction Multiple Stream Prediction Oliverio J. Santana, Alex Ramirez,,andMateoValero, Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain Barcelona Supercomputing Center

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. CS 320 Ch. 16 SuperScalar Machines A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. A superpipelined machine is one in which a

More information

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University zmily@stanford.edu, christos@ee.stanford.edu

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha 21264. The Alpha s predictor is very successful. On the SPECfp 95 benchmarks, there

More information

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes Module: Branch Prediction Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed

More information

Continuous Adaptive Object-Code Re-optimization Framework

Continuous Adaptive Object-Code Re-optimization Framework Continuous Adaptive Object-Code Re-optimization Framework Howard Chen, Jiwei Lu, Wei-Chung Hsu, and Pen-Chung Yew University of Minnesota, Department of Computer Science Minneapolis, MN 55414, USA {chenh,

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

A Study of Control Independence in Superscalar Processors

A Study of Control Independence in Superscalar Processors A Study of Control Independence in Superscalar Processors Eric Rotenberg, Quinn Jacobson, Jim Smith University of Wisconsin - Madison ericro@cs.wisc.edu, {qjacobso, jes}@ece.wisc.edu Abstract An instruction

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su A Scheme of Predictor Based Stream Buffers Bill Hodges, Guoqiang Pan, Lixin Su Outline Background and motivation Project hypothesis Our scheme of predictor-based stream buffer Predictors Predictor table

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

An EPIC Processor with Pending Functional Units

An EPIC Processor with Pending Functional Units In Proceedings of the 4th International Symposium on High Performance Computing (ISHPC), May 2002, (c) Springer-Verlag. An EPIC Processor with Pending Functional Units Lori Carter, Weihaw Chuang and Brad

More information

On the Predictability of Program Behavior Using Different Input Data Sets

On the Predictability of Program Behavior Using Different Input Data Sets On the Predictability of Program Behavior Using Different Input Data Sets Wei Chung Hsu, Howard Chen, Pen Chung Yew Department of Computer Science University of Minnesota Abstract Smaller input data sets

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Instruction Fetch Deferral using Static Slack

Instruction Fetch Deferral using Static Slack Instruction Fetch Deferral using Static Slack Gregory A. Muthler David Crowe Sanjay J. Patel Steven S. Lumetta Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

EECS 470 Final Exam Fall 2013

EECS 470 Final Exam Fall 2013 EECS 470 Final Exam Fall 2013 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: Page# Points 2 /21 3 /8 4 /12 5 /10 6

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001 Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec IRISA / INRIA January 2001 2 Introduction Context: dynamic instruction scheduling in out-oforder

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman Inherently Lower Complexity Architectures using Dynamic Optimization Michael Gschwind Erik Altman ÿþýüûúùúüø öõôóüòñõñ ðïîüíñóöñð What is the Problem? Out of order superscalars achieve high performance....butatthecostofhighhigh

More information

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University Stanford, CA 94305, USA zmily@stanford.edu,

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, Krste Asanovic CARRV 2017 10/14/2017 Evaluation Methodologies For Computer

More information

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation 1 Motivation

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598

J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 Trace-driven performance exploration of a PowerPC 601 workload on wide superscalar processors J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown

More information

The Effect of Program Optimization on Trace Cache Efficiency

The Effect of Program Optimization on Trace Cache Efficiency The Effect of Program Optimization on Trace Cache Efficiency Derek L. Howard and Mikko H. Lipasti IBM Server Group Rochester, MN 55901 derekh@us.ibm.com, mhl@ece.cmu.edu 1 Abstract Trace cache, an instruction

More information

Optimizations Enabled by a Decoupled Front-End Architecture

Optimizations Enabled by a Decoupled Front-End Architecture Optimizations Enabled by a Decoupled Front-End Architecture Glenn Reinman y Brad Calder y Todd Austin z y Department of Computer Science and Engineering, University of California, San Diego z Electrical

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

2

2 1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge

More information

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction

More information

High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues

High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues Paper # 3 Paper # 2 Paper # 1 Paper # 3 Paper # 7 Paper # 7 Paper # 6 High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues For Superscalar Processors The Basic Block Fetch Limitation/Cache

More information

Week 11: Assignment Solutions

Week 11: Assignment Solutions Week 11: Assignment Solutions 1. Consider an instruction pipeline with four stages with the stage delays 5 nsec, 6 nsec, 11 nsec, and 8 nsec respectively. The delay of an inter-stage register stage of

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Processor Architecture V! Wrap-Up!

Processor Architecture V! Wrap-Up! Processor Architecture V! Wrap-Up! Lecture 7, April 28 th 2011 Alexandre David Slides by Randal E. Bryant! Carnegie Mellon University! Overview! Wrap-Up of PIPE Design! n Performance analysis! n Fetch

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Advanced Caching Techniques

Advanced Caching Techniques Advanced Caching Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory

More information

Performance of Runtime Optimization on BLAST. Technical Report

Performance of Runtime Optimization on BLAST. Technical Report Performance of Runtime Optimization on BLAST Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159 USA

More information

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering

More information

An Effective Automated Approach to Specialization of Code

An Effective Automated Approach to Specialization of Code An Effective Automated Approach to Specialization of Code Minhaj Ahmad Khan, H.-P. Charles, and D. Barthou University of Versailles-Saint-Quentin-en-Yvelines, France. Abstract. Application performance

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information