Understanding The Effects of Wrong-path Memory References on Processor Performance

Similar documents
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Techniques for Efficient Processing in Runahead Execution Engines

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N.

Execution-based Prediction Using Speculative Slices

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

A Hybrid Adaptive Feedback Based Prefetcher

Instruction Level Parallelism (Branch Prediction)

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

SEVERAL studies have proposed methods to exploit more

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Dynamic Memory Dependence Predication

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

ECE404 Term Project Sentinel Thread

Achieving Out-of-Order Performance with Almost In-Order Complexity

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Filtered Runahead Execution with a Runahead Buffer

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

DIVERGE-MERGE PROCESSOR: GENERALIZED AND ENERGY-EFFICIENT DYNAMIC PREDICATION

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Copyright by Onur Mutlu 2006

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

Threshold-Based Markov Prefetchers

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set

Fetch Directed Instruction Prefetching

NAME: Problem Points Score. 7 (bonus) 15. Total

Optimizing SMT Processors for High Single-Thread Performance

Computer Architecture, Fall 2010 Midterm Exam I

MODELING EFFECTS OF SPECULATIVE INSTRUCTION EXECUTION IN A FUNCTIONAL CACHE SIMULATOR AMOL SHAMKANT PANDIT, B.E.

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Limiting the Number of Dirty Cache Lines

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

An Efficient Indirect Branch Predictor

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Using a Serial Cache for. Energy Efficient Instruction Fetching

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Control Hazards. Prediction

Speculative Multithreaded Processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Exploitation of instruction level parallelism

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Please state clearly any assumptions you make in solving the following problems.

Hardware Loop Buffering

Speculative Execution for Hiding Memory Latency

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Static Branch Prediction

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

The Use of Multithreading for Exception Handling

The Predictability of Computations that Produce Unpredictable Outcomes

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

Use-Based Register Caching with Decoupled Indexing

Copyright by Mary Douglass Brown 2005

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Incorporating Predicate Information Into Branch Predictors

Multithreaded Value Prediction

High Performance and Energy Efficient Serial Prefetch Architecture

Optimizations Enabled by a Decoupled Front-End Architecture

Precise Instruction Scheduling

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

Architectures for Instruction-Level Parallelism

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

Wrong-Path Instruction Prefetching

Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems

Performance Oriented Prefetching Enhancements Using Commit Stalls

15-740/ Computer Architecture

HW1 Solutions. Type Old Mix New Mix Cost CPI

Prefetch-Aware DRAM Controllers

Improving Cache Performance using Victim Tag Stores

Superscalar Organization

A Study for Branch Predictors to Alleviate the Aliasing Problem

Demand fetching is commonly employed to bring the data

Combining Local and Global History for High Performance Data Prefetching

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Microarchitecture Overview. Performance

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Area-Efficient Error Protection for Caches

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

One-Level Cache Memory Design for Scalable SMT Architectures

Transcription:

Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin

2 Motivation Processors spend a significant portion of execution time on the wrong path 47% of all cycles are spent fetching instructions on the wrong path for SPEC INT 2000 benchmarks Many memory references are made on the wrong path 6% of all data references are wrong-path references 50% of all instruction references are wrong-path references We would like to understand the effects of wrong-path memory references on processor performance The goal is to build hardware/software mechanisms that take advantage of this understanding

3 Questions We Seek to Answer 1. How important is it to correctly model wrong-path memory references? How does memory latency and window size affect this? 2. What is the relative significance of negative and positive effects of wrong-path memory references on performance? Negative: Cache pollution and bandwidth/resource contention Positive: Prefetching 3. What kind of code structures lead to the positive effects of wrong-path memory references?

4 Experimental Methodology Execution-driven simulator, accurate wrong-path model Cycle-accurate, aggressive memory model Baseline processor Models the Alpha ISA 8-wide fetch, decode, issue, retire 128-entry instruction window Hybrid conditional branch predictor 64K-entry PAs, 64K-entry gshare, 64K-entry selector Aggressive indirect branch predictor (64K-entry target cache) 20-cycle branch misprediction latency 64 KB, 4-way, 2-cycle L1 data and instruction caches Maximum 128 outstanding L1 misses 1 MB, 8-way, 10-cycle, unified L2 cache Minimum 500-cycle L2 miss latency 12 SPEC INT 2000 benchmarks evaluated

Error in IPC if Wrong-path References are not Modeled: The Effect of Memory Latency 5 Negative error means wrong-path references are beneficial for performance. Maximum error is 7%, average error is 3.5% for a 500-cycle memory latency In general, error increases as memory latency increases

Error in IPC if Wrong-path References are not Modeled: The Effect of Instruction Window Size 6 Negative error means wrong-path references are beneficial for performance Maximum error is 10%, average error is 4% for a 512-entry window In general, error increases as instruction window size increases

7 Insights Wrong-path references are important to model (to avoid errors of up to 10%) Wrong-path references usually have a positive impact on IPC performance. Wrong-path references have a negative impact on performance for a few benchmarks (notably gcc and vpr) We would like to understand why.

8 Negative Effects of Wrong-path References 1. Bandwidth and resource contention Our simulations show this effect is not significant. 2. Cache pollution To evaluate this effect, we examine four idealized models: No I-cache pollution: Wrong-path references cause no pollution in the instruction cache No D-cache pollution: Wrong-path references cause no pollution in the data cache No L2 cache pollution: Wrong-path references cause no pollution in the L2 cache No pollution in any cache: Wrong-path references cause no pollution in any of the caches We compare the performance of these models to the baseline model which correctly models wrong-path references

IPC Improvement if Pollution Caused by Wrong-path References is Eliminated 9 L1 pollution does not significantly affect performance. L2 pollution is the most significant negative effect of wrong-path references. L2 pollution is the cause of performance loss in gcc and vpr.

10 Insights Why is bandwidth/resource contention not a problem? Enough bandwidth/resources in the memory system to accommodate the wrong-path references. Why is L1 pollution not significant? L1 misses due to pollution are short-latency misses (10 cycles) These latencies are tolerated by the instruction window. Why is L2 pollution the dominant negative effect? L2 misses incur a very high penalty (500 cycles) L2 miss latency cannot be tolerated by the instruction window. Hence, an L2 miss very likely results in a full-window stall on the correct path.

Positive Effects of Wrong-path References: Prefetching 11 A cache miss caused by a wrong-path reference is useful if the data brought by the miss is later accessed by the correct-path before being evicted. Most wrong-path cache misses are useful 76% of all wrong-path data cache misses are useful 69% of all wrong-path instruction cache misses are useful 75% of all wrong-path L2 misses are useful But why? We would like to understand the code structures that result in useful wrong-path data prefetches.

Code Structures that Cause Useful Wrong-path Data References 12 1. Prefetching for later loop iterations Wrong-path execution of a loop iteration can prefetch data for the correct-path execution of the same iteration Most useful wrong-path prefetches in mcf and bzip2 are generated this way 2. One loop prefetching for another Wrong-path execution of a loop can prefetch data for the correctpath execution of another loop, if they are both working on the same data structure 3. Prefetching in control-flow hammocks Wrong-path execution of the taken path can prefetch data for the correct-path execution of the not-taken path

Prefetching for Later Loop Iterations (an example from mcf) 13 1 : arc_t *arc; // array of arc_t structures 2 : // initialize arc (arc =...) 3 : 4 : for ( ; arc < stop_arcs; arc += size) { 5 : if (arc->ident > 0) { // frequently mispredicted branch 6 : // function calls and 7 : // operations on the structure pointed to by arc 8 : //... 9 : } 10: } Processor mispredicts the branch on line 5 and does not execute the body of the if statement on the wrong path. Instead, next iteration is executed on the wrong path. This next iteration initiates a load request for arc->ident, which misses the data cache. When the mispredicted branch is resolved, processor recovers and first executes the body of the if statement on the correct path. On the correct path, the processor executes the next iteration and initiates a load request for arc->ident, which is already prefetched.

14 Conclusions Modeling wrong-path memory references is important. Not modeling them causes errors of up to 10% in IPC estimates. Effect of wrong-path references on IPC increases with increasing memory latency and increasing window size. In general, wrong-path memory references are beneficial for performance due to prefetching effects. The dominant negative effect of wrong-path references is the L2 pollution they cause in the L2 cache. We identified three code structures that are responsible for the prefetching benefits of wrong-path memory references.

Backup Slides

16 Related Work Butler [1993] was the first one to realize that wrong-path references can be beneficial for performance. Moudgill et. al. [1998] investigated the performance impact of wrong-path references with a 40-cycle memory latency. They found that IPC error is negligible if wrong-path references are not modeled. Pierce and Mudge [1994] studied the effect of wrong-path references on cache hit rates. They found wrong-path references can increase correct-path cache hit rates. Bahar and Albera [1998] proposed the use of a small fully-associative buffer at the L1 cache level to reduce the pollution caused by wrong-path references. Unfortunately, they assumed a priori that wrong-path references degrade performance. We build on previous work by focusing on understanding the relative significance of different negative and positive effects of wrong-path references. We focus on IPC performance instead of cache hit rates. We also aim to understand why wrong-path references are useful by identifying the code structures that cause the positive effects of wrong-path references.

17 Future Research Directions Techniques to reduce L2 cache pollution due to wrong-path references Caching wrong-path data in a separate buffer? Predicting the usefulness of wrong-path references? Techniques to make wrong-path execution more beneficial for performance Can the compiler structure the code such that wrong-path execution always (or usually) provides prefetching benefits?

How Much Time Does the Processor Spend on Wrong -path Instructions? 18 47% of all cycles are spent on the wrong path for SPEC INT 2000 benchmarks 53% of all fetched instructions and 17% of all executed instructions are on the wrong path

IPC of the Baseline Processor 19

Percentage of Executed Wrong-path Instructions: The Effect of Memory Latency 20 Increasing the memory latency does not increase the number of executed wrong-path instructions due to the limited (128-entry) instruction window size

Percentage of Executed Wrong-path Instructions: The Effect of Instruction Window Size 21 A larger instruction window is able to execute more instructions (hence, more memory references) on the wrong path

22 Classification of Data Cache Misses Right bar: No wrong-path memory references Left bar: Wrong-path memory references are correctly modeled On average, 76% of the wrong-path data cache misses are fully or partiallyused by the correct path.

23 Classification of Instruction Cache Misses Right bar: No wrong-path memory references Left bar: Wrong-path memory references are correctly modeled On average, 69% of the wrong-path instruction cache misses are fully or partially-used by the correct path.

24 Classification of L2 Cache Misses The number of misses suffered on the correct path (correct-path miss + partially-used wrong-path miss) increase for gcc and vpr if wrong-path memory references are correctly modeled. This is the cause for IPC degradation for gcc and vpr. On average, 75% of the wrong-path L2 cache misses are fully or partially-used by the correct path.

IPC Improvement if Pollution Caused by Wrong-path References is Eliminated (16 KB L1 Caches) 25 L1 pollution does not significantly affect performance even with 16 KB L1 caches. L2 pollution is the most significant negative effect of wrong-path references.

Memory System 26