Portland State University ECE 587/687. Memory Ordering

Similar documents
Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Prefetching

Portland State University ECE 587/687. Superscalar Issue Logic

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Portland State University ECE 588/688. Cray-1 and Cray T3E

ECE 587 Advanced Computer Architecture I

Portland State University ECE 588/688. Memory Consistency Models

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Superscalar Processors

Execution-based Prediction Using Speculative Slices

Superscalar Organization

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Control Hazards. Prediction

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Portland State University ECE 588/688. Dataflow Architectures

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

Control Hazards. Branch Prediction

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Wide Instruction Fetch

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

SPECULATIVE MULTITHREADED ARCHITECTURES

Hardware-Based Speculation

Case Study IBM PowerPC 620

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PowerPC 620 Case Study

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Memory Dependence Prediction and Access Ordering for Memory Disambiguation and Renaming

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Chapter. Out of order Execution

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

The Pentium II/III Processor Compiler on a Chip

Instruction-Level Parallelism and Its Exploitation

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Instruction Level Parallelism (ILP)

Dynamic Memory Dependence Predication

LIMITS OF ILP. B649 Parallel Architectures and Programming

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Out of Order Processing

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Complex Pipelines and Branch Prediction

EECS 570 Final Exam - SOLUTIONS Winter 2015

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Instruction Level Parallelism

Dynamic Scheduling. CSE471 Susan Eggers 1

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Fetch Directed Instruction Prefetching

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Dynamic Memory Scheduling

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Chapter 06: Instruction Pipelining and Parallel Processing

Architectures for Instruction-Level Parallelism

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Multiprocessor Synchronization

November 7, 2014 Prediction

ECE 341. Lecture # 15

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture Spring 2016

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Superscalar Processors Ch 14

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Memory Consistency Models

Last lecture. Some misc. stuff An older real processor Class review/overview.

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Lecture-13 (ROB and Multi-threading) CS422-Spring

Instruction scheduling. Based on slides by Ilhyun Kim and Mikko Lipasti

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University

Transcription:

Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen and Haitham Akkary 2012

Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize ILP, loads/stores may execute out of order Memory operations require special handling Compare: Register dependences identified at decode time Allows early renaming to remove false dependences Maximizes ILP Memory dependences cannot be determined before execution since memory addresses need to be computed first False dependences exist in memory execution stream Multiple stores to the same bytes Frequent due to stack pushes and pops Portland State University ECE 587/687 Spring 2012 2

Handling Memory Operations (Cont.) Functions that a processor has to do for memory operations: Enforcing memory dependences among loads and stores Stores/Writes to memory need to be ordered and non-speculative Stores issue to the data cache after retirement Loads and stores ordering is enforced using load and store buffers while allowing out of order execution Review Fig 11 of Smith & Sohi s paper Portland State University ECE 587/687 Spring 2012 3

Store Buffer Store addresses are buffered in a queue Entries allocated in program order at rename Store addresses remain buffered until: Store data is available Store instruction is retired in the reorder buffer New load addresses are checked with the waiting store addresses If there is a match: The load waits OR Store data is bypassed to the matching load If there is no match: Load can execute speculatively Load waits till all prior store addresses are computed Portland State University ECE 587/687 Spring 2012 4

Store Buffer (Cont.) To match loads with store addresses, the store buffer is typically organized as a fully-associative queue An address comparator in each buffer entry compares each store address to the address of a load issued to the data cache Multiple stores to same address may be present Load-store address match is qualified with age information to match a load to the last preceding store in program order Age is typically checked by attaching the number of the preceding store entry to each load at rename Effectively provides a form of memory renaming Portland State University ECE 587/687 Spring 2012 5

Load Buffer Loads can speculatively issue to the data cache out-of-order But load issue may stall Unavailable resources/ports in the data cache Unknown store addresses Delayed store data execution Memory mapped I/O Lock operations Misaligned loads Portland State University ECE 587/687 Spring 2012 6

Load Buffer (Cont.) Load buffer is provided for stalled loads to wait Load entries are allocated in program order at rename A stalled load simply waits in its buffer entry until stall condition is removed Scheduling logic checks for awakened loads and re-issues them to the data cache, typically in program order Portland State University ECE 587/687 Spring 2012 7

Load Buffer (Cont.) Load buffers are sometimes used for load-store dependence speculation When a load issues ahead of a preceding store, it is impossible to perform address match Option 1: Stall the load until all prior store addresses are computed Significant performance impact in machines with deep, wide pipelines Option 2: Speculate that the load does not depend on the previous unknown stores Needs misprediction detection and recovery mechanism Portland State University ECE 587/687 Spring 2012 8

Load Buffer (Cont.) One detection mechanism is to make the load buffer a fully associative queue Store addresses are checked against all previously issued load addresses Only younger loads need to be checked Portland State University ECE 587/687 Spring 2012 9

Memory Consistency Load buffer snoops other processors stores to maintain memory consistency on some processors Stores from other threads on the same processor need also to be snooped in the load buffer to maintain memory consistency Memory consistency models define ordering requirements of loads and stores to different addresses and from multiple processors More details in ECE 588/688 Examples for memory consistency models Sequential Consistency (SC) Total Store Order (TSO) Portland State University ECE 587/687 Spring 2012 10

Memory Dependence Prediction Memory Order Violation: A load is executed before a prior store, reads the wrong value False Dependence: Loads wait unnecessarily for stores to different addresses Goals of Memory Dependence Prediction: Predict the load instructions that would cause a memory-order violation Delay execution of these loads only as long as necessary to avoid violations Portland State University ECE 587/687 Spring 2012 11

Memory Dependence Prediction Memory misprediction recovery is expensive Requires a pipeline flush if a store address matches a younger issued load Compare to branch mispredictions Memory dependence prediction minimizes such mispredictions Issue younger loads if predictor predicts no dependence Stall younger loads if predictor predicts dependence With a good predictor, this approach minimizes unnecessary stalls (false dependences) as well as pipeline flushes from mispredictions Portland State University ECE 587/687 Spring 2012 12

Simple Alternatives to Memory No Speculation Dependence Prediction Issue for any load waits till prior stores have issued Naïve Speculation Always issue and execute loads when their register dependences are satisfied, regardless of memory dependences Perfect Memory dependence prediction Does not cause memory order violations Avoids all false dependences Simulation Results in Table 3.1, Fig. 3.1 Portland State University ECE 587/687 Spring 2012 13

Store Sets Based on the assumption that future dependences can be predicted from past behavior Each load has a store set consisting of all stores upon which it has ever depended Store is identified by its PC When program starts, all loads have empty store sets When a memory order violation happens, store is added to load s store set Example 5.1 in paper Results in Fig 5.1 and Fig 5.2 Discuss 2-bit saturating counter (Fig 5.3, 5.4) Portland State University ECE 587/687 Spring 2012 14

Store Set Implementation Figure 6.1 shows basic components Store Set Identifier Table (SSIT): PC-indexed, maintains store sets Last Fetched Store Table (LFST) maintains dynamic inst. count about most recently fetched store for each store set Limitations: Store PCs exist in one store set at a time Two loads depending on the same store can share a store set All stores in a store set are executed in order Portland State University ECE 587/687 Spring 2012 15

Implementation Details Recently fetched loads Access SSIT based on their PC, get their SSID If SSID is valid, LFST is accessed to get most recent store in the load s store set Recently fetched stores Access SSIT based on their PC If SSID is valid, then store belongs to a valid store set Access LFST to get most recently fetched store information in its store set Update LFST inserting its own dynamic inst. count since it is now the last fetched store in that store set After store is issued, it invalidates the LFST entry if it refers to itself to ensure loads & stores are only dependent on stores that haven t been issued Portland State University ECE 587/687 Spring 2012 16

Store Set Assignment Destructive interference happens because stores can belong to only one store set Store set merging avoids the problem When a store-load pair causes a memory order violation: If neither has been assigned a store set, a store set is allocated and assigned to both instructions If load has been assigned a store set but the store hasn t, the store is assigned the load s store set If store has been assigned a store set but the load hasn t, the load is assigned the store s load set If both have store sets, one of them is declared the winner, and the instruction belonging to the loser s store set is assigned the winner s store set Results for store set merging in Fig 6.2 Portland State University ECE 587/687 Spring 2012 17

Tuesday Reading Assignment D. Kroft, Lockup-Free Instruction Fetch/Prefetch Cache Organization, ISCA 1981 (Read) M.D. Hill and A.J. Smith, Evaluating Associativity in CPU Caches, IEEE Transactions on Computers, 1989 (Skim) Thursday Norman P. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," ISCA 1990 (Review) Tien-Fu Chen and Jean-Loup Baer, "Effective Hardware-Based Data Prefetching for High-Performance Processors," IEEE Transactions on Computers, 1995 (Skim) Midterm exam on Tuesday May 8 th Portland State University ECE 587/687 Spring 2012 18