Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Similar documents
ever increasing each year, cache memory has been a bridge to alleviate this

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Precise Instruction Scheduling

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Speculative Multithreaded Processors

Execution-based Prediction Using Speculative Slices

Low-Complexity Reorder Buffer Architecture*

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache

Understanding The Effects of Wrong-path Memory References on Processor Performance

Exploiting Partial Operand Knowledge

POSH: A TLS Compiler that Exploits Program Structure

SPECULATIVE MULTITHREADED ARCHITECTURES

Understanding Scheduling Replay Schemes

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Dynamically Controlled Resource Allocation in SMT Processors

Microarchitecture Overview. Performance

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

EITF20: Computer Architecture Part4.1.1: Cache - 2

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Store Buffer Design in First-Level Multibanked Data Caches

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Microarchitecture Overview. Performance

Detecting Global Stride Locality in Value Streams

Design of Experiments - Terminology

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Speculative Multithreaded Processors

Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

ECE404 Term Project Sentinel Thread

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Lec 11 How to improve cache performance

" # " $ % & ' ( ) * + $ " % '* + * ' "

Techniques for Efficient Processing in Runahead Execution Engines

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Computer Architecture Spring 2016

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

Instruction Level Parallelism

Select-Free Instruction Scheduling Logic

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

MEMORY ORDERING: A VALUE-BASED APPROACH

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Dynamic Memory Dependence Predication

PMPM: Prediction by Combining Multiple Partial Matches

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Hardware-Based Speculation

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Fused Two-Level Branch Prediction with Ahead Calculation

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Branch prediction ( 3.3) Dynamic Branch Prediction

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

CS3350B Computer Architecture

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Modern Computer Architecture

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

High Performance SMIPS Processor. Jonathan Eastep May 8, 2005

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

Exploring Wakeup-Free Instruction Scheduling

MLP Exploitation with Seamless Preload

Computer Systems Architecture

Computer Architecture Spring 2016

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Basics

Lecture 11. Virtual Memory Review: Memory Hierarchy

Execution-based Prediction Using Speculative Slices

Cache performance Outline

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS422 Computer Architecture

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

ELE 655 Microprocessor System Design

Boosting Trace Cache Performance with NonHead Miss Speculation

Bumyong Choi Leo Porter Dean Tullsen University of California, San Diego ASPLOS

Itanium 2 Processor Microarchitecture Overview

Scalable Hardware Memory Disambiguation for High ILP Processors

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

Getting CPI under 1: Outline

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Transcription:

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering University of Florida 1

It s the Memory, Stupid 3 out of 4 cycles are waiting for memory in Pentium-Pro and Alpha-21164, running TPC-C, -Richard Sites Cache Performance: f (Hit time, Miss ratio, Miss penalty) Performance impact due to cache latency: lw r1 <= 0(r2) issue register addgen mem1 mem2 hit/miss commit add r3 <= r2, r1 stall stall stall issue register execute commit Minimum 3-cycle hit latency Speculative issue for hit Squashed / re-issued on miss (Total 3 cycles) 2

Outline Introduction Motivation and Related Work Bloom Filtering Cache Misses Partitioned-address BF Partial-address BF Pipeline Microarchitecture with BF Performance Evaluation Summary 3

No Speculation vs. Perfect Scheduling 3 2.5 no data speculation perfect scheduler 2 IPC 1.5 1 0.5 0 Bzip Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr Average No data speculation degrades IPC by 15-20% (SPECint2000) 4

Related Work Simple solution: Assume loads always hit L1 Alpha 21264: 4-bit counter; +1 on hit, -2 on miss Predict hit counter>=8; mini-recovery if miss Predict miss counter<8; delay dependent if hit Pentium-4: Always hit, replay when miss Agressive way prediction, needs recovery Replay way prediction and load dependent scheduling Recovery buffer: Free speculative instructions from scheduling queue Hybrid 2-level hit/miss predictor (like branch) 5

Bloom Filter (BF) - Introduction A probabilistic algorithm to test membership in a large set using hashing functions to a bit array. A BF quickly filters non-members without querying the large set. Filter cache misses Accurate scheduling load dependents Overlap and reduce cache miss penalty Filter other table accesses 6

Partitioned-Address BF A1 A2 A3 A4 Request Line Addr. Cache Miss R1 R2 R3 R4 Replaced Line Addr. Increment counter on cache miss BF1 BF2 BF3 BF4 Decrement counter on cache miss True if cache miss False likely cache hit Guarantee miss! 7

Partial-Address BF Requested Line Address Tag Index offset BF array Set bit on cache miss partial address (p bits) L1 Cache Tag Reset bit on cache miss but no collision False, miss True, (may) hit Guarantee miss! Partial Address (p-bits) of Replaced Cache Line Hit / Miss Detector Collision Detector Collision? (yes/no) 8

Virtual-Address BF Benefit of cache miss filtering Must be early before dependent scheduling Must be accurate Filter cache miss using virtual address Must handle address synonym problem Special handling collision detection for physical caches Partial-address (virtual) BFs Separate collision detection from cache tag path Compare replaced PA with all other PAs with the same page offset Reset BF array only when no match is found 9

Pipeline Execution with BF lw r1 <= 0(r2) issue register addgen mem1 mem2 hit/miss commit BF Filtering add r3 <= r2, r1 stall stall stall issue register execute commit Virtual address BF filter cache miss 2 cycles earlier, Still one cycle too late for dependent scheduling Delay one cycle for dependent scheduling Always hit, precise recovery for single-cycle speculation 10

Cache Miss Filtered by BF Load: SCH REG AGN CA1 CA2 H/M BF L2 Access Dependent: SCH Filter miss, cancel dependents SCH REG EXE WRB CMT Independent: SCH REG EXE WRB CMT (No Penalty) Cache miss filtered by BF, one cycle window Precise recovery, reschedule only dependents (have to wait for miss anyway) No penalty for independent instructions 11

Cache Miss Not Filtered. Load: SCH REG AGN CA1 CA2 H/M BF L2 Access Speculative Window Cache Miss (not filtered) Dependent: SCH REG EXE Flush SCH REG EXE WRB CMT Dependent: SCH Flush SCH REG EXE WRB CMT Independent: SCH REG EXE Flush SCH REG EXE WRB CMT 4-cycle penalty Independent: SCH Flush SCH REG EXE WRB CMT 2-cycle penalty 12

Prefetching with BF Load: SCH REG AGN CA1 CA2 H/M BF L2 Access Dependent: SCH Filter miss cancel dependents SCH REG EXE WRB CMT Filter miss trigger L2 access 2-cycle earlier SCH REG EXE WRB CMT BF filltered miss trigger L2 miss 2-cycle earlier L1 miss is guaranteed Applicable to other caches, TLB, branch prediction tables, etc. 13

Predictors and Extra Hardware Prediction Method Partitioned BF - 3 Partitioned BF - 4 Partial BF - 1x Partial BF - 4x Partial BF - 16x Partial BF - 64x Always-Hit Counter - 1 (Alpha) Counter - 128 Counter - 512 Counter - 2048 Counter - 8192 Additional Table (Bits) 15360 4480 512 2048 8192 32768 0 4 512 2048 8192 32768 14

Cache Miss Filter Rate Partition-3 Partial-1x Partial-4x Partial-16x Partial-64x Cache Miss Filtering Rate (%) 100 90 80 70 60 50 40 30 20 10 0 Bzip Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr Average Partitioned BFs perform poorly; 97% miss filtered by Partial-16x 15

Accuracy of Various Predictors 100 90 80 70 60 Incorrect-cancel Incorrect-delay Correct hit/miss 50 16 Percentage Correct/Incorrect Counter-1 Counter-128 Counter-512 Counter-2048 Counter-8192 Always-hit Partition-3 Partial-1x Partial-4x Partial-16x Partial-64x Bloom Filter has NO incorrect-delay, i.e. predict miss always miss

IPC Comparison 3 No-speculation Counter-1 Counter-2048 Always-hit Partition-3 Partial-16x Partial-16x-DP Perfetct-sch Perfect-sch-DP 2.5 2 IPC 1.5 1 0.5 0 Bzip Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr Average 17

Effect of Data Cache Size Perfect-sch-DP Partial-16x-DP Partial-16x Always-hit 20 IPC Improvement (%) 18 16 14 12 10 8 8KB 16KB 32KB 64KB Cache Size 18

Impact of RUU to Always-Hit Partial-16x Partial-16x-perfect Partial-16x-DP Partial-16x-DP-perfect 10 IPC Improve over Always-Hit (%) 9 8 7 6 5 4 3 2 1 0 ruu32 ruu64 ruu128 19

Summary Data speculation schedules load dependents without knowing load latency Bloom Filter identifies 97% of the misses early using a small 1KB bit array 19% IPC improvement over no-speculation, 6% IPC improvement over always-hit method Reach 99.7% IPC of a perfect scheduler 20