ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

Similar documents
A best-offset prefetcher

Best-Offset Hardware Prefetching

A Best-Offset Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Combining Local and Global History for High Performance Data Prefetching

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Sandbox Based Optimal Offset Estimation [DPC2]

Spatial Memory Streaming (with rotated patterns)

Towards Bandwidth-Efficient Prefetching with Slim AMPM

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

Best-Offset Hardware Prefetching

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Techniques for Efficient Processing in Runahead Execution Engines

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Computer Architecture Spring 2016

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Portland State University ECE 587/687. Caches and Prefetching

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

A Comparison of Capacity Management Schemes for Shared CMP Caches

OpenPrefetch. (in-progress)

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

History Table. Latest

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Hierarchy. Slides contents from:

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Marten van Dijk Syed Kamran Haider, Chenglu Jin, Phuong Ha Nguyen. Department of Electrical & Computer Engineering University of Connecticut

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

ארכיטקטורת יחידת עיבוד מרכזי ת

EE 660: Computer Architecture Advanced Caches

CMU : Advanced Computer Architecture Handout 5: Cache Prefetching Competition ** Due 10/11/2005 **

CS7810 Prefetching. Seth Pugsley

Staged Memory Scheduling

Lecture 10: Large Cache Design III

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

5 Solutions. Solution a. no solution provided. b. no solution provided

Cache Performance (H&P 5.3; 5.5; 5.6)

An Accurate and Detailed Prefetching. Simulation Framework for gem5

Understanding The Effects of Wrong-path Memory References on Processor Performance

A BRANCH PREDICTOR DIRECTED DATA CACHE PREFETCHER FOR OUT-OF-ORDER AND MULTICORE PROCESSORS. A Thesis PRABAL SHARMA

1/19/2009. Data Locality. Exploiting Locality: Caches

Advanced Computer Architecture

ECE 30 Introduction to Computer Engineering

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Data Prefetching by Exploiting Global and Local Access Patterns

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks

NAME: Problem Points Score. 7 (bonus) 15. Total

Multithreaded Value Prediction

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

A BRANCH-DIRECTED DATA CACHE PREFETCHING TECHNIQUE FOR INORDER PROCESSORS

Cache Refill/Access Decoupling for Vector Machines

CS152 Computer Architecture and Engineering

15-740/ Computer Architecture

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction

Cache Controller with Enhanced Features using Verilog HDL

Advanced Caches. ECE/CS 752 Fall 2017

Prefetch-Aware DRAM Controllers

Predictor-Directed Stream Buffers

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

A Detailed GPU Cache Model Based on Reuse Distance Theory

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Instruction Cache Level-0 Instruction-fetch Error

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency


Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part4.1.1: Cache - 2

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Caches and Prefetching

Introduction to OpenMP. Lecture 10: Caches

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Transcription:

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen

Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher (Best-Offest Prefetcher) Look-Ahead Prefetcher (Signature Pattern Prefetcher) Experiment Result Tool Background Simulation Result Conclusion

Data Prefetching(background) Prefetching the data before it is needed Reduce the compulsory miss Reduce the memory access latency if - High prefetching accuracy - Prefetch early enough Goal: Predict which address is needed in the future Next N Lines Prefetching Always prefetch next N cache lines after a demand access or a demand miss Pros - Easy to implement - Suitable for sequential accessing Cons - Waste bandwidth on unwanted data if data pattern is irregular

Data Prefetching(background) Offset Prefetching Prefetch the address with an offset X If X = 1 => Next Line Prefetching Demanded Address [A] Prefetcher with offset X Prefetch Address [A] +X

Stride Prefetcher A kind of offset prefetcher with fixed distance 2 kind of stride prefetcher Cons Program Counter (PC) based - Record the distance of memory access by load instruction - Next time fetch the same load instruction is fetched, prefetch last address + distance Cache block address based - Prefetch A + X, A + 2X, A + 3X.. - Stream Buffer is a special case of this type of prefetcher Avoid cache pollution If load miss, check stream buffer and pop to cache If stream buffer also miss, allocate a new stream buffer Distance (Stride) is fixed Several varied offset scheme are proposed - Best Offset (BO) Prefetcher - Signature Path Prefetcher (SPP)

Best-Offset Prefetcher (Idea) Varied offset through a learning procedure Finding the best offset value of different application Several candidate of offset are tested RR table records the completed prefetch requests Prefetch Y, current offset is O => Y-O saves into RR table

Best-Offset Prefetcher (Learning) In learning phase, all the offsets in list will be tested (1 round) Each L2 access test 1 offset DPC ver.: 46 offsets, paper ver.: 52 offsets If hit in RR table, score + 1 All scores reset to 0 when learning phase begin If learning phase finish (ex. 100 round) or some offset reach SCORE_MAX (DPC ver. = 31), the phase ends The offset with highest score will be the best offset New learning phase starts

Best-Offset Prefetcher 1-degree prefetcher (only prefetch 1 address) Prefetch 2 offset result many useless prefetch Turn off the prefetcher if the best score too low BAD_SCORE is the threshold Learning procedure still work MSHR threshold varied depends on BO score and L3 access rate

Signature Path Prefetcher Path confidence-based prefetcher History lookahead prefetching SPP table trained by L2 access Prefetching depend on The signature and pattern in SPP The overall probability

Signature Path Prefetcher Table Updating When L2 access a page, the corresponding signature table will update - Offset update - Offset difference (delta) use to generate new signature - The old signature is used for modifying pattern table Same pattern will have same signature Reduce training time and PT store entries

Signature Path Prefetcher Prefetching Search the signature of current accessed page Choose the delta with highest probability P i (C delta /C sig ) of ith prefetch depth If multiply of all P larger than threshold - Prefetch current address + delta - Use delta to update signature and access pattern table again If P < threshold, the procedure end

Gem5 Simulation System CPU L1D Cache L1I Cache L2 Cache Prefetcher Memory Interface

Gem5 Implementation CPU L1D Cache L1I Cache L2 Cache Prefetcher Memory Interface

System Setting CPU: TimingSimpleCPU L1 Caches (Data/Instruction) L2 Cache Size 16 KB 128 KB Associativity 2 8 Tag Latency 2 Cycle 20 Cycles Data Latency 2 Cycle 20 Cycles MSHR Size 4 Entries 16 Entries Replacement LRU LRU

Gem5 Implementation CPU L1D Cache L1I Cache L2 Cache Prefetcher Write Queue MSHR Priority Queue Memory Interface

L2 Cache-Prefetcher Interface L2 Cache Notify on Access& Fill Prefetcher hit/miss PC Address set way is prefetch Evicted address Write Queue MSHR insert Compute Prefetch Priority Queue Memory Interface

Bechmark Setting Prefetcher Configuration basic PF Types: Baseline, Stride (PC&Addr) DPC-2 PF Types: Best Offset, SPP, AMPM, Benchmark SPEC 2006-450.soplex - 454.calculix - 456 Hmmer - 462.libquantum - 998.specrand

Sim. Result Normalized Performance

Sim. Result L2C Overall Miss Rate

Sim. Result Miss Rate Improvement

Conclusion Contribution Open source Github repository @ hfsken/gem5-with-dpc-2- prefetcher - With DPC-2 Wrapper for adding DPC PFs - Integrated with following DPC PFs: Best-Offset, AMPM, Stride, SPP Summary For a short term running time - Best-offset Prefetcher have better performance in benchmarks which has more regular access pattern and higher overall miss rate - Performance gain in random access pattern is ignorable Future Work Complete documentation on Github repo Analysis benchmark behavior in detail in the report

Reference [1] Pierre Michaud, Best-Offset Hardware Prefetching IEEE HPCA, 2016 [2] Pierre Michaud, A Best-Offset Prefetcher DPC-2, 2015 [3] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson and Z. Chishti, "Path confidence based lookahead prefetching, IEEE/ACM MICRO 2016 [4] Jinchun Kim, Paul V. Gratz and A. L. Narasimha Reddy, Lookahead Prefetching with Signature Path, DPC-2, 2015 [5] Course Slide of Prof. Onur Mutlu, CMU [6] Course Slide of Prof. Mikko Lipasti, UW Madison