A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su

Similar documents
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Understanding The Effects of Wrong-path Memory References on Processor Performance

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Selective Fill Data Cache

Execution-based Prediction Using Speculative Slices

Threshold-Based Markov Prefetchers

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Predictor-Directed Stream Buffers

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

A Hybrid Adaptive Feedback Based Prefetcher

Speculative Multithreaded Processors

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Demand fetching is commonly employed to bring the data

Advanced Caching Techniques

Computer Systems Architecture

Control Hazards. Prediction

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Combining Local and Global History for High Performance Data Prefetching

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Using a Serial Cache for. Energy Efficient Instruction Fetching

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

PowerPC 620 Case Study

Precise Instruction Scheduling

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Performance Oriented Prefetching Enhancements Using Commit Stalls

Towards a More Efficient Trace Cache

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Techniques for Efficient Processing in Runahead Execution Engines

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set

DATA CACHE PREFETCHING USING

EITF20: Computer Architecture Part4.1.1: Cache - 2

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

CS 838 Chip Multiprocessor Prefetching

Improving System. Performance: Caches

Spectral Prefetcher: An Effective Mechanism for L2 Cache Prefetching

Use-Based Register Caching with Decoupled Indexing

AS the processor-memory speed gap continues to widen,

Lec 11 How to improve cache performance

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

SEVERAL studies have proposed methods to exploit more

Computer Architecture Spring 2016

Control Hazards. Branch Prediction

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

6.004 Tutorial Problems L22 Branch Prediction

Multithreaded Value Prediction

Portland State University ECE 588/688. Cray-1 and Cray T3E

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Wenisch Final Review. Fall 2007 Prof. Thomas Wenisch EECS 470. Slide 1

Dynamic Branch Prediction

Portland State University ECE 587/687. Caches and Prefetching

Hardware-Based Speculation

Lecture-18 (Cache Optimizations) CS422-Spring

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Lecture Topics. Why Predict Branches? ECE 486/586. Computer Architecture. Lecture # 17. Basic Branch Prediction. Branch Prediction

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

Optimizations Enabled by a Decoupled Front-End Architecture

ECE 587/687 Final Exam Solution

Assignment 2: Understanding Data Cache Prefetching

DESIGNING a microprocessor is extremely time consuming

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Chapter-5 Memory Hierarchy Design

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Prefetch-Aware DRAM Controllers

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Pipelining. Parts of these slides are from the support material provided by W. Stallings

FLAP: Flow Look-Ahead Prefetcher

HW1 Solutions. Type Old Mix New Mix Cost CPI

Fetch Directed Instruction Prefetching

The Impact of Timeliness for Hardware-based Prefetching from Main Memory

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

CS341l Fall 2009 Test #2

ECE404 Term Project Sentinel Thread

The Alpha Microprocessor Architecture. Compaq Computer Corporation 334 South St., Shrewsbury, MA

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

The University of Texas at Austin

Portland State University ECE 587/687. Memory Ordering

ECE 571 Advanced Microprocessor-Based Design Lecture 13

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Transcription:

A Scheme of Predictor Based Stream Buffers Bill Hodges, Guoqiang Pan, Lixin Su

Outline Background and motivation Project hypothesis Our scheme of predictor-based stream buffer Predictors Predictor table Stream buffer Results and analysis Research topics left

Contributors to Microprocessor Performance Degradation Branch misprediction Different hazards TLB misses Cache misses Instruction cache misses Data cache miss rate (vary from a few to tens of percent)

Techniques to Reduce Data Cache Misses In Superscalar Redesign RF-Cache-DRAM memory hierarchy Addition of L2 and L3 caches, move L2 onto chip Victim cache, Pseudo cache Stream Buffer Aggressive instruction scheduling

How a Standard Stream Buffer Works From N. Jouppi

Stream Buffer Research Activities Where to start prefetching? Sequentially prefetching Prefetching with a stride Predictor based prefetching How to avoid frequent flushing of stream buffer How to prevent pollution of L1 data cache? Decoupled or coupled predictors?

Project Hypothesis Memory accesses have patterns, like strides. Localized data accesses have frequently used data with smaller strides. Non-localized data accesses have infrequently used data with larger strides Other access patterns may need to be identified.

Our Scheme of Predictor Based Stream Buffer Highlights Decoupled predictors from stream buffer Our stream buffer is a modified standard stream buffer. A group of predictors with each predictor responsible for prediction falling within a superline (a consecutive range of addresses across several blocks) Active predictors implemented in a table similar to TLB Predictor makes prefetching decision and choose L1 data cache and stream buffer as prefetching target No data stored in stream buffer is promoted to L1 data cache.

Block Diagram

Predictors Decoupled predictors based upon data addresses Predictor states Non-predictable states: UNINIT and RANDOM Predictable states: STRIDE and SEMISTRIDE Open to new states for new access patterns Look up and update a predictor on each memory access Predicts to fetch into stream buffer for long strides and into L1 data cache for short strides

Predictor State Transition Diagram

Predictor Table Virtual address indexed table A structure similar to TLB or even may be implemented inside TLB It hosts a number of active predictors.

Our Stream Buffer A small fully associative buffer Replacement algorithms Track the latest accessed line Kick out the next line from the latest accessed one Stream buffer is orthogonal with L1 data access

Performance Comparison (IPC) Benchmark 4k L1 4k L1 with SB (12) 8k L1 vpr 0.9326 0.9337 0.9341 mcf 1.0839 1.1163 1.1150 ammp 0.7477 0.7614 0.7495

Prefetch Distribution BM stride 8 Req. PF PF Done 8<stride<32 Req. PF PF Done stride 32 Req. PF PF Done vpr 4,196,590 49,124 1,956,374 142,838 4,920,951 255,372 mcf 16,460,868 7,530,754 9,239,329 41,531 2,504,918 1,756,205 ammp 9,248,365 28,294 438,057 2,718 3,339,133 2,589,741

Cache and Stream Buffer Performance Benchmark Dec L1 Misses Stream Hits SB Prefetches L1 Prefetches vpr 39,976 190,001 398,210 49,124 mcf 546,259 538,237 3,234,279 7,530,754 ammp 16,584 461,067 2,615,366 28,294

Data Analysis Distributions of strides exist in application but show different characteristics in different applications. Gained performance is modest but comparable to the increase of the L1 cache size. Performance increase correlates to prefetch prediction effciency and prefetch efficiency.

Research Topics Left Implementation of better predictors like Markov predictors Implementation of PC-coupled predictors and compare with current performance. Redesign the structure of stream buffer and see performance changes Search for more data access patterns and add new predictor states

Conclusions There exist localized and non-localized data accesses. Localized data tend to have smaller strides. The distribution of stride sizes depend upon applications. Predictor-based stream buffer can increase performance.