Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Similar documents
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Execution-based Prediction Using Speculative Slices

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Understanding The Effects of Wrong-path Memory References on Processor Performance

Speculative Multithreaded Processors

Instruction Level Parallelism (Branch Prediction)

Microarchitecture Overview. Performance

Improving Instruction Delivery with a Block-Aware ISA

Dynamic Code Value Specialization Using the Trace Cache Fill Unit

Microarchitecture Overview. Performance

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

ECE404 Term Project Sentinel Thread

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

Multiple Stream Prediction

Wide Instruction Fetch

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Hardware-Based Speculation

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

November 7, 2014 Prediction

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Techniques for Efficient Processing in Runahead Execution Engines

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Towards a More Efficient Trace Cache

Control Hazards. Prediction

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Continuous Adaptive Object-Code Re-optimization Framework

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

A Study of Control Independence in Superscalar Processors

Lec 11 How to improve cache performance

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su

Superscalar Processors

An EPIC Processor with Pending Functional Units

On the Predictability of Program Behavior Using Different Input Data Sets

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Hyperthreading Technology

Instruction Fetch Deferral using Static Slack

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Chip-Multithreading Systems Need A New Operating Systems Scheduler

INSTRUCTION LEVEL PARALLELISM

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Threshold-Based Markov Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Static Branch Prediction

EECS 470 Final Exam Fall 2013

Combining Local and Global History for High Performance Data Prefetching

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

HW1 Solutions. Type Old Mix New Mix Cost CPI

Speculative Multithreaded Processors

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Dynamic Branch Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Weaving Relations for Cache Performance

Itanium 2 Processor Microarchitecture Overview

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Low-Complexity Reorder Buffer Architecture*

J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598

The Effect of Program Optimization on Trace Cache Efficiency

Optimizations Enabled by a Decoupled Front-End Architecture

Chapter 2: Memory Hierarchy Design Part 2

A Hybrid Adaptive Feedback Based Prefetcher

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

More on Conjunctive Selection Condition and Branch Prediction

2

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues

Week 11: Assignment Solutions

Dynamic Control Hazard Avoidance

Computer Architecture Spring 2016

Processor Architecture V! Wrap-Up!

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Dynamically Controlled Resource Allocation in SMT Processors

Advanced Caching Techniques

Performance of Runtime Optimization on BLAST. Technical Report

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

An Effective Automated Approach to Specialization of Code

Fetch Directed Instruction Prefetching

Transcription:

: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth Stream-based Performance results Future work Related work Conclusion 2

: Motivation Macro-code vs. Micro-code Architectural vs. micro-architectural Compatibility vs. implementation-specific legacy-free : best of both Intermediate binary code representation Architecturally visible Implementation specific Legacy-free Like Pentium processor performance monitor interface 3 : Example Original Code T T T A B C D Trigger FT FT FT CFG A B C D Trigger i i+ i+2 i+3 i+4 Boundary i+5 i+6 Boundary i+n i+n+ i+n+2 i+n+3 Layout Original Code A D C A B C D B Appendix 4 2

: Benefits Modest hardware change Decode logic to recognize trigger and boundaries Unobtrusive to original code Appendix to original code d binary is guaranteed to be correct Backwards compatible Legacy free 5 For Itanium ISA Target : Future OOO Itanium machine Address code-density and fetch bandwidth issues Itanium 6 3

Itanium Itanium ISA puts instructions in bundles according to pre-defined templates Each bundle has 3 instruction slots and template Slot 0 Slot Slot 2 trigger: NOP with location encoded in offset boundary: use 2 out of 8 undefined templates 7 Improving Fetch Bandwidth Two ways in which fetch bandwidth can degrade Wasteful instructions in the code stream Reduced code density Occupy pipeline resources Instruction cache misses Remedy Time spent waiting for cache fill Reduce wasteful instructions Reduce I-Cache misses 8 4

Wasteful Instructions NOPs Predicated false instructions Effectively become NOPs in dynamic code stream Slot 0 Slot Pred Slot NOP False 2 9 Wasteful Instructions Cache line misaligned instructions Branch targets must be bundle-aligned Taken branches might not be bundle-aligned 2-bundle width cache line Slot 0 Taken Slot Branch Slot 2 Branch Slot Target 0 Slot Slot 2 3 wasteful instructions 4 wasteful instructions 0 5

Wasteful Instruction Profile Percent of Total Instructions 0 9 8 7 6 5 4 3 2 Percentage Wasteful Instructions gzip gcc mcf crafty parser gap average (27b) (3b) (4.7b) (2.5b) (7.45b) (20b) (50b) Not Wasteful Misalignment due to Taken Branch Misalignment due to Branch Target Pred False Nop Benchmark (total inst count) On average, nearly 3 of fetched instructions are wasteful Stream-Based Streams Instructions between two retired taken branches static streams cover 9 execution 2 64.gzip 2 76.gcc 0 0 8 8 6 6 4 4 2 2 64 327 490 653 2 8.mcf 0 8 6 4 2 86 979 42 305 468 63 794 957 2390 4779 768 9557 946 4335 6724 93 2502 2389 26280 28669 07 23 39 425 53 637 743 849 955 06 67 273 2 86.crafty 0 97.parser 2 254.gap 0 9 0 8 8 7 8 6 6 5 6 4 4 3 4 2 2 2 593 85 777 2369 296 3553 445 4737 5329 592 653 705 483 965 447 2 929 24 2893 3375 3857 4339 482 5303 5785 269 537 805 073 34 609 877 245 243 268 2949 327 6

Streams vs. Basic Blocks Size Streams are bigger than basic blocks Location Streams reside in section Basic blocks reside in original code Transitions and Prediction Branch prediction: blockblock, block-stream Stream prediction: streamstream, stream-block Number of Instructions 30 25 20 5 0 5 0 Average Basic Block and Stream Sizes gzip gcc mcf crafty parser gap average inst per basic block inst per stream 3 Stream Prediction Traditional Branch Predictor Original Code Traditional Branch Predictor Stream Predictor Stream Predictor Enables more aggressive instruction prefetch Predict the next stream instead of basic block Fetch further into code stream: different from regular prefetching 4 7

Stream Prefetching Stream stream-stream prepare to branch Prefetch next stream 5 Experimental Setup Pipeline Structure Branch Predictor In-order: 2 stage Out-of-order: 6 stage 2K entry GSHARE with 256 entry, 4-way BTB Instruction Queue Execute Bandwidth Cache Structure Memory Latency 2K entry GSHARE with 32 entry, 4-way BTB 8 bundle (24-instruction) queue 6 instructions Out-of-order: 8 instruction schedule window L (separate I and D): 6K 4-way, 8 way banked, 2 cycle latency L2 (shared): 256K 4-way, 8 way banked, 4 cycle latency L3 (shared): 3072K 2-way, way banked, 30 cycle latency Fill buffer (MSHR): 6 entries. All caches have 64 byte lines 230 cycle latency, TLB Miss Penalty 30 cycles 6 8

Objectives of Experiment. Quantify benefit of code density reduction Cache alignment and NOP removal 2. Quantify benefit of branch and stream prediction 3. Quantify benefit of stream prefetching Stream prefetching by prepare to branch 7 Configurations optimizations Cache alignment and NOP removal Microarchitecturalvariations Name Stream Predictor Branch Predictor Stream Prefetch No Stream Prefetch No No No Stream Prefetch Hybrid Predictor No Stream Prefetch No Stream Prefetch Hybrid Predictor 8 9

In-Order Performance.8 Performance Improvement of In-order Execution with Speedup.6.4.2 0.8 0.6 0.4 0.2 No Stream Prefetch No Stream Prefetch Hybrid Predictor Stream Prefetch Stream Prefetch Hybrid Predictor 0 Fetch bandwidth is not a critical resource for in-order pipelines 5% average speedup crafty gap gcc gzip mcf parser average Benchmark 9 Out-of-Order Performance 2.5 Performance Improvement of Out-of-Order Execution with Speedup 2.5 0.5 No Stream Prefetch No Stream Prefetch Hybrid Predictor Stream Prefetch Stream Prefetch Hybrid Predictor 0 Fetch bandwidth is critical for out-of-order pipelines 32% average speedup crafty gap gcc gzip mcf parser average Benchmark enables prefetching to reduce I-cache miss 20 0

Future Work Other optimization via Continuing with the fetch bandwidth theme Removal of predicate false instructions Alternative encodings Dependency encoding Dynamic construction Larger workloads Database applications Managed runtime environments (e.g. Java/.Net) 2 Related Work Trace cache: Microarchitectural E. Rotenberg, S. Bennett, and J. E. Smith. Trace Cache: a low latency approach to high bandwidth instruction fetching. replay: Microarchitectural B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. J. Patel, and S. S. Lumetta. Performance Characterization of a Hardware Framework for Dynamic Optimization. Dynamo: Software, Architectural V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A Transparent Dynamic Optimization System. Spike: Software, Architectural A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. L. Larriba-Pey, P. G. Lowney, and M. Valero. Code Layout Optimizations for Transaction Processing Workloads. 22

Conclusion as an intermediate encoding Architecturally visible Implementation specific Legacy-free for Itanium : Stream-based Target fetch bandwidth Code density Additional prefetchopportunity 5% speedup on in-order 32% speedup on out-of-order 23 2