Portland State University ECE 587/687. Memory Ordering

Similar documents
Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. Superscalar Issue Logic

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

ECE 341. Lecture # 15

ECE 587 Advanced Computer Architecture I

Superscalar Processors

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Execution-based Prediction Using Speculative Slices

Control Hazards. Prediction

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

Superscalar Organization

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Portland State University ECE 587/687. Caches and Prefetching

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Dynamic Memory Dependence Predication

Memory Dependence Prediction and Access Ordering for Memory Disambiguation and Renaming

Out of Order Processing

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Complex Pipelines and Branch Prediction

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture Topics ECE 341. Lecture # 14. Unconditional Branches. Instruction Hazards. Pipelining

Instruction Level Parallelism

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Hardware-Based Speculation

Case Study IBM PowerPC 620

Control Hazards. Branch Prediction

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

Dynamic Memory Scheduling

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Portland State University ECE 588/688. Cray-1 and Cray T3E

Architectures for Instruction-Level Parallelism

Chapter. Out of order Execution

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

The Processor: Improving the performance - Control Hazards

The Pentium II/III Processor Compiler on a Chip

Wide Instruction Fetch

Hardware-based Speculation

SPECULATIVE MULTITHREADED ARCHITECTURES

Portland State University ECE 588/688. Dataflow Architectures

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Instruction Level Parallelism (ILP)

November 7, 2014 Prediction

COMPUTER ORGANIZATION AND DESI

Dynamic Scheduling. CSE471 Susan Eggers 1

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC411 Fall 2013 Midterm 2 Solutions

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Instruction-Level Parallelism and Its Exploitation

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Branch Prediction Chapter 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Processor: Superscalars Dynamic Scheduling

Full Datapath. Chapter 4 The Processor 2

Portland State University ECE 588/688. Memory Consistency Models

LECTURE 3: THE PROCESSOR

Computer Architecture EE 4720 Final Examination

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Chapter 06: Instruction Pipelining and Parallel Processing

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

PowerPC 620 Case Study

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Lecture 13: Branch Prediction

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture: Out-of-order Processors

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

LIMITS OF ILP. B649 Parallel Architectures and Programming

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Transcription:

Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2018

Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize ILP, loads/stores may execute out of order Memory operations require special handling Compare: Register dependences identified at decode time Allows early renaming to remove false dependences Maximizes ILP Memory dependences cannot be determined before execution since memory addresses need to be computed first False dependences exist in memory execution stream Multiple stores to the same bytes Frequent due to stack pushes and pops Portland State University ECE 587/687 Fall 2018 2

Handling Memory Operations (Cont.) Functions that a processor has to do for memory operations: Enforcing memory dependences among loads and stores Stores/Writes to memory need to be ordered and non-speculative Stores issue to the data cache after retirement Loads and stores ordering is enforced using load and store buffers while allowing out of order execution Review Fig 11 of Smith & Sohi s paper Portland State University ECE 587/687 Fall 2018 3

Store Buffer Store addresses are buffered in a queue Entries allocated in program order at rename Store addresses remain buffered until: Store data is available Store instruction is retired in the reorder buffer New load addresses are checked with the waiting store addresses If there is a match: The load waits OR Store data is bypassed to the matching load If there is no match: Load can execute speculatively Load waits till all prior store addresses are computed Portland State University ECE 587/687 Fall 2018 4

Store Buffer (Cont.) To match loads with store addresses, the store buffer is typically organized as a fully-associative queue An address comparator in each buffer entry compares each store address to the address of a load issued to the data cache Multiple stores to same address may be present Load-store address match is qualified with age information to match a load to the last preceding store in program order Age is typically checked by attaching the number of the preceding store entry to each load at rename Effectively provides a form of memory renaming Portland State University ECE 587/687 Fall 2018 5

Load Buffer Loads can speculatively issue to the data cache out-of-order But load issue may stall Unavailable resources/ports in the data cache Unknown store addresses Delayed store data execution Memory mapped I/O Lock operations Misaligned loads Portland State University ECE 587/687 Fall 2018 6

Load Buffer (Cont.) Load buffer is provided for stalled loads to wait Load entries are allocated in program order at rename A stalled load simply waits in its buffer entry until stall condition is removed Scheduling logic checks for awakened loads and re-issues them to the data cache, typically in program order Portland State University ECE 587/687 Fall 2018 7

Load Buffer (Cont.) Load buffers are sometimes used for load-store dependence speculation When a load issues ahead of a preceding store, it is impossible to perform address match Option 1: Stall the load until all prior store addresses are computed Significant performance impact in machines with deep, wide pipelines Option 2: Speculate that the load does not depend on the previous unknown stores Needs misprediction detection and recovery mechanism Portland State University ECE 587/687 Fall 2018 8

Memory Dependence Prediction Memory Order Violation: A load is executed before a prior store, reads the wrong value False Dependence: Loads wait unnecessarily for stores to different addresses Goals of Memory Dependence Prediction: Predict the load instructions that would cause a memory-order violation Delay execution of these loads only as long as necessary to avoid violations Portland State University ECE 587/687 Fall 2018 9

Memory Dependence Prediction Memory misprediction recovery is expensive Requires a pipeline flush if a store address matches a younger issued load Compare to branch mispredictions Memory dependence prediction minimizes such mispredictions Issue younger loads if predictor predicts no dependence Stall younger loads if predictor predicts dependence With a good predictor, this approach minimizes unnecessary stalls (false dependences) as well as pipeline flushes from mispredictions Portland State University ECE 587/687 Fall 2018 10

Simple Alternatives to Memory No Speculation Dependence Prediction Issue for any load waits till prior stores have issued Naïve Speculation Always issue and execute loads when their register dependences are satisfied, regardless of memory dependences Perfect Memory dependence prediction Does not cause memory order violations Avoids all false dependences Simulation Results in Table 3.1, Fig. 3.1 Portland State University ECE 587/687 Fall 2018 11

Store Sets Based on the assumption that future dependences can be predicted from past behavior Each load has a store set consisting of all stores upon which it has ever depended Store is identified by its PC When program starts, all loads have empty store sets When a memory order violation happens, store is added to load s store set Example 5.1 in paper Results in Fig 5.1 and Fig 5.2 Discuss 2-bit saturating counter (Fig 5.3, 5.4) Portland State University ECE 587/687 Fall 2018 12

Store Set Implementation Figure 6.1 shows basic components Store Set Identifier Table (SSIT): PC-indexed, maintains store sets Last Fetched Store Table (LFST) maintains dynamic inst. count about most recently fetched store for each store set Limitations: Store PCs exist in one store set at a time Two loads depending on the same store can share a store set All stores in a store set are executed in order Portland State University ECE 587/687 Fall 2018 13

Implementation Details Recently fetched loads Access SSIT based on their PC, get their SSID If SSID is valid, LFST is accessed to get most recent store in the load s store set Recently fetched stores Access SSIT based on their PC If SSID is valid, then store belongs to a valid store set Access LFST to get most recently fetched store information in its store set Update LFST inserting its own dynamic inst. count since it is now the last fetched store in that store set After store is issued, it invalidates the LFST entry if it refers to itself to ensure loads & stores are only dependent on stores that haven t been issued Portland State University ECE 587/687 Fall 2018 14

Store Set Assignment Destructive interference happens because stores can belong to only one store set Store set merging avoids the problem When a store-load pair causes a memory order violation: If neither has been assigned a store set, a store set is allocated and assigned to both instructions If load has been assigned a store set but the store hasn t, the store is assigned the load s store set If store has been assigned a store set but the load hasn t, the load is assigned the store s load set If both have store sets, one of them is declared the winner, and the instruction belonging to the loser s store set is assigned the winner s store set Results for store set merging in Fig 6.2 Portland State University ECE 587/687 Fall 2018 15

Reading Assignment Reading assignment for next class D. Kroft, Lockup-Free Instruction Fetch/Prefetch Cache Organization, ISCA 1981 (Read) M.D. Hill and A.J. Smith, Evaluating Associativity in CPU Caches, IEEE Transactions on Computers, 1989 (Read) Portland State University ECE 587/687 Fall 2018 16