Dynamic Memory Dependence Predication

Similar documents
Hardware-Based Speculation

Architectures for Instruction-Level Parallelism

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Copyright 2012, Elsevier Inc. All rights reserved.

Multiple Instruction Issue and Hardware Based Speculation

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Getting CPI under 1: Outline

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

A Mechanism for Verifying Data Speculation

Understanding The Effects of Wrong-path Memory References on Processor Performance

PowerPC 620 Case Study

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Use-Based Register Caching with Decoupled Indexing

CS152 Computer Architecture and Engineering. Complex Pipelines

Multiple Instruction Issue. Superscalars

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Techniques for Efficient Processing in Runahead Execution Engines

5008: Computer Architecture

Pentium 4 Processor Block Diagram

EECS 470 Midterm Exam

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Execution-based Prediction Using Speculative Slices

Superscalar Organization

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Last lecture. Some misc. stuff An older real processor Class review/overview.

6.004 Tutorial Problems L22 Branch Prediction

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Portland State University ECE 587/687. Memory Ordering

Computer System Architecture Quiz #2 April 5th, 2019

EECS 470 Final Project Report

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects


Write only as much as necessary. Be brief!

CS433 Homework 2 (Chapter 3)

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

CS252 Graduate Computer Architecture Midterm 1 Solutions

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Complex Pipelines and Branch Prediction

Full Datapath. Chapter 4 The Processor 2

CS 152, Spring 2011 Section 8

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Limitations of Scalar Pipelines

E0-243: Computer Architecture

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

CS146 Computer Architecture. Fall Midterm Exam

Please state clearly any assumptions you make in solving the following problems.

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Case Study IBM PowerPC 620

Good luck and have fun!

LaZy Superscalar. 1. Introduction

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Chapter 4. The Processor

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Computer Architecture EE 4720 Final Examination

Metodologie di Progettazione Hardware-Software

Static, multiple-issue (superscaler) pipelines

EECS 470 Midterm Exam Winter 2015

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Jim Keller. Digital Equipment Corp. Hudson MA

Wide Instruction Fetch

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

November 7, 2014 Prediction

LECTURE 3: THE PROCESSOR

Multithreaded Value Prediction

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

EECC551 Exam Review 4 questions out of 6 questions

EE557--FALL 2001 MIDTERM 2. Open books

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Multithreaded Processors. Department of Electrical Engineering Stanford University

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Instruction-Level Parallelism and Its Exploitation

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

The Processor: Instruction-Level Parallelism

Outline. In-Order vs. Out-of-Order. Project Goal 5/14/2007. Design and implement an out-of-ordering superscalar. Introduction

The Pentium II/III Processor Compiler on a Chip

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Transcription:

Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles

Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is implemented to keep the speculative store instructions before they are retired. 3. Load instructions need to associatively search the store queue to have an early execution.

Store Queue Design SW1 SW2 SW3 LW The addresses match. The store has to be older than the load. If there are multiple matching stores, the youngest store is selected. SW4 Store Queue

Store Queue Design SW1 SW2 LW SW3 The addresses are matching. The store has to be older than the load. If there are multiple matching stores, the youngest store is selected. Due to the hardware complexity, store queue does not scale well. SW4 Store Queue

Store-Queue-Free Architecture SW1 SW2 : SW $9(P10), 0 ($7) SW3 LW : LW $6, 0 ($8) Memory Dependence Prediction The DEF-store-load-USE dependence is collapsed to the DEF-USE. SW4 P10 (Memory Cloaking)

Store-Queue-Free Architecture SW1 SW2 : SW $9(P10), 0 ($7) If the memory dependence prediction is wrong, a misspeculation recovery is launched. SW3 LW : LW $6, 0 ($8) SW4 P10 (Memory Cloaking)

Store-Queue-Free Architecture SW1 SW2 : SW $9(P10), 0 ($7) If the memory dependence prediction is wrong, a misspeculation recovery is launched. If a load is frequently mispredicted, it is marked as a low confidence load. SW3 LW : LW $6, 0 ($8) SW4 P10 (Memory Cloaking)

Store-Queue-Free Architecture SW1 SW2 : SW $9(P10), 0 ($7) SW3 LW : LW $6, 0 ($8) If the memory dependence prediction is wrong, a misspeculation recovery is launched. If a load is frequently mispredicted, it is marked as a low confidence load. A low confidence load only gets its data from the cache. Therefore, it has to wait for the predicted store (SW2) retires and updates the cache (too late). SW4 P10 (Memory Cloaking)

Store-Queue-Free Architecture SW1 SW2 : SW $9(P10), 0 ($7) SW3 LW : LW $6, 0 ($8) SW4 P10 (Memory Cloaking) If the memory dependence prediction is wrong, a misspeculation recovery is launched. If a load is frequently mispredicted, it is marked as a low confidence load. A low confidence load only gets its data from the cache. Therefore, it has to wait for the predicted store (SW2) retires and updates the cache (too late). Low confidence loads are kept in a special buffer where they are selected to execute once their predicted stores retire.

Load instruction distribution 6.98% 14.27% 78.75% Direct access Read data from the cache. Bypassing Rename the destination register with the store register (memory cloaking). Delayed access Do not read the cache until the store is retired (low confidence).

Delayed access VS. Bypassing If the number is greater than zero, that means the delayed access load instructions take more cycles to execute. Average execution time comparison

Dynamic Memory Dependence Predication SW1 SW2 : SW $9, 0 ($7) DMDP SW3 ==? Data Cache LW : LW $6, 0 ($8) SW4

Dynamic Memory Dependence Predication SW1 SW2 : SW $9, 0 ($7) DMDP SW3 ==? Data Cache LW : LW $6, 0 ($8) SW4 CMP $32, $7, $8

Dynamic Memory Dependence Predication SW1 SW2 : SW $9, 0 ($7) $33 DMDP SW3 ==? Data Cache LW : LW $6, 0 ($8) SW4 CMP $32, $7, $8 LW $33, 0($8)

Dynamic Memory Dependence Predication SW1 SW2 : SW $9, 0 ($7) $33 DMDP SW3 ==? Data Cache LW : LW $6, 0 ($8) SW4 CMP $32, $7, $8 LW $33, 0($8) CMOV $6, $32, $9 CMOV $6,!$32, $33

Real dependence DMDP can also mispredict dependence SW1 SW2 : SW $9, 0 ($7) DMDP SW3 ==? Data Cache LW : LW $6, 0 ($8) SW4

DMDP can also mispredict dependence When a predicate is inserted for a load instruction: 1. The load depends on the predicted store 2. The load does not depend on any in-flight store 3. The load depends on a different store

Memory dependence prediction results over low confidence loads 7.7% 3.7% 88.6% IndepStore The load is independent of any in-flight store. DiffStore The load is dependent on a different store. Correct The prediction is right.

Memory dependence prediction results over low confidence loads 7.7% 3.7% 88.6% IndepStore The load is independent of any in-flight store. DiffStore The load is dependent on a different store. Correct The prediction is right. DMDP can cover IndepStore (88.6%) and Correct (7.7%), only DiffStore (3.7% low confidence loads) is not covered.

Simulation configuration ROB / RS / PRF 256 / 64 / 320 Fetch / Decode / Issue 8 / 8 / 8 Store Queue (baseline) Store Buffer Cache Memory Recovery penalty Int ALU / Int MUL Int DIV, FP ALU Branch predictor Tech node Clock frequency Unlimited entries, 4 cycles latency 16 entries, store coalesce 32kB 8-way set associative, 4 cycles hit latency, 2 read ports, 1 write port, il1, dl1; 512kB 8-way set associative, 10 cycles hit latency L2 16GB DDR3L-1600, 2 channels, 2 ranks, 8 banks, open page, up to 64 pending request Minimum 15 cycles 1 cycles / 3 cycles 7 cycles 8kB TAGE 22nm 3.2GHz 1. Baseline : a superscalar processor with unlimited store queue entries, using store-set to resolve memory dependence prediction. 2. NoSQ : the store-queue-free architecture. 3. DMDP : dynamic memory dependence predication. 4. Perfect : NoSQ + perfect memory dependence predictor.

Evaluation results 0.992 1.049 1.068 IPC normalized to the baseline

Evaluation results Average execution time for low confidence loads is significantly reduced (DMDP VS. NoSQ). hmmer 37.19 cycles -> 8.91 cycles wrf 61.59 cycles -> 12.78 cycles IPC normalized to the baseline

Evaluation results DMDP still encounters some memory dependence mispredictions in some benchmarks bzip2 1.409 MPKI hmmer 1.029 MPKI (Mispredictions Per 1k retired instructions) IPC normalized to the baseline

Energy Delay Product 0.933 1. Reducing the execution time. 2. Fewer memory dependence mispredictions means fewer misprediction recoveries. EDP normalized to NoSQ

Conclusion 1. DMDP is the first mechanism to use predication for memory dependence handling. 2. The storage for maintaining the low confidence loads is completely removed. 3. The memory dependence is translated to register dependence and will be checked in the reservation station.