HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Similar documents
CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Static & Dynamic Instruction Scheduling

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

EECS 470 Midterm Exam Winter 2009

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Hardware-Based Speculation

Out of Order Processing

Chapter. Out of order Execution

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

EE 660: Computer Architecture Superscalar Techniques

Computer Architecture Spring 2016

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

Dynamic Memory Scheduling


Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

ELE 655 Microprocessor System Design

Wide Instruction Fetch

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Portland State University ECE 587/687. Memory Ordering

E0-243: Computer Architecture

ECE 4750 Computer Architecture, Fall 2018 T12 Advanced Processors: Memory Disambiguation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Handout 2 ILP: Part B

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Getting CPI under 1: Outline

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

5008: Computer Architecture

November 7, 2014 Prediction

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

EECS 470 Midterm Exam Winter 2008 answers

Hardware-based Speculation

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Processor Architecture

Adapted from David Patterson s slides on graduate computer architecture

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

Multiple Instruction Issue and Hardware Based Speculation

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CSE502: Computer Architecture CSE 502: Computer Architecture

Store-Load Forwarding: Conventional. NoSQ: Store-Load Communication without a Store Queue. Store-Load Forwarding: Hmmm

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

The Pentium II/III Processor Compiler on a Chip

Chapter 4 The Processor 1. Chapter 4D. The Processor

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Portland State University ECE 587/687. Memory Ordering

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

OOO Execution and 21264

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Precise Exceptions and Out-of-Order Execution. Samira Khan

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Architectures for Instruction-Level Parallelism

Four Steps of Speculative Tomasulo cycle 0

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

CS 152, Spring 2011 Section 8

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

EECS 470 Midterm Exam Answer Key Fall 2004

EECS 470 Midterm Exam

Dynamic Control Hazard Avoidance

The Problem with P6. Problem for high performance implementations

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Last lecture. Some misc. stuff An older real processor Class review/overview.

Hardware Speculation Support

Hardware-based Speculation

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Instruction Level Parallelism

b) Register renaming c) CDB, register file, and ROB d) 0,1,X (output of a gate is never Z)

SPECULATIVE MULTITHREADED ARCHITECTURES

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Lecture-13 (ROB and Multi-threading) CS422-Spring

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Full Datapath. Chapter 4 The Processor 2

Dynamic Memory Dependence Predication

EECS 470 Midterm Exam

14:332:331 Pipelined Datapath

Transcription:

HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as soon as possible, detect violations (aggressive) When a store executes, it checks if any later loads executed too early (to same address). If so, flush pipeline Learn violations over time, selectively reorder (predictive) Before Wrong(?) ld r,(sp) ld r,(sp) ld r,(sp) ld r,(sp) add r,r,r //stall ld r,0(r) //does rsp? st r,0(sp) add r,r,r ld r,0(r) ld r,(r) //does r+sp? ld r,(r) st r,0(sp) sub r,r,r //stall sub r,r,r st r,(r) st r,(r) 9 Loads and Stores Loads and Stores Instruction Instruction fdiv p,p p fdiv p,p p st p [p] st p [p] st p [p] st p [p] ld [p] p ld [p] p Cycle : Can ld[p] p execute? (why or why not?) Aliasing (again) p p? p p? 9 9 Loads and Stores Instruction fdiv p,p p st p [p] st p [p] ld [p] p Suppose p p and p!= p Can ld[p] p execute? (why or why not?) Memory Forwarding Stores write cache at commit Commit is in-order, delayed by all instructions Allows stores to be undone on branch mis-predictions, etc. Loads read cache Early execution of loads is critical Forwarding Allow store load communication before store commit Conceptually like reg. bypassing, but different implementation Why? Addresses unknown until execute 99 00

Forwarding: Store Queue Store Queue Holds all in-flight stores CAM: searchable by address Age logic: determine youngest matching store older than load Store execution Write Store Queue Address + Data Load execution Search SQ Match? Forward Read D$ address address load position Store Queue (SQ) age Data cache data in value data out head tail Load scheduling Store Load Forwarding: Get value from executed (but not comitted) store to load Load Scheduling: Determine when load can execute with regard to older stores Conservative load scheduling: All older stores have executed Some architectures: split store address / store data Only require known address Advantage: always safe Disadvantage: performance (limits out-of-orderness) 0 0 ld [r] r ld [r] r add r,r r st r [r] ld [r] r ld [r] r add r,r r st r [r] // loop control here With conservative load scheduling, what can go out of order? ld [p] p ld [p] p add p,p p st p [p] ld [p] p ld [p] p9 add p,p9 p st p [p] wide, conservative scheduling issue load per cycle Cycle : Dispatch insns #, # 0 0 ld [p] p ld [p] p ld [p] p ld [p] p add p,p p add p,p p st p [p] st p [p] ld [p] p ld [p] p ld [p] p9 ld [p] p9 add p,p9 p add p,p9 p st p [p] st p [p] wide, conservative scheduling issue load per cycle Cycle : Why don t we issue #? wide, conservative scheduling issue load per cycle Cycle : Why don t we issue #? Why don t we issue #? 0 0

ld [p] p ld [p] p ld [p] p ld [p] p add p,p p add p,p p st p [p] st p [p] ld [p] p ld [p] p ld [p] p9 ld [p] p9 add p,p9 p add p,p9 p st p [p] st p [p] wide, conservative scheduling issue load per cycle Cycle : Why don t we issue #? wide, conservative scheduling issue load per cycle Cycle : Finally some action! 0 0 ld [p] p ld [p] p ld [p] p ld [p] p add p,p p add p,p p st p [p] st p [p] ld [p] p ld [p] p ld [p] p9 ld [p] p9 add p,p9 p add p,p9 p st p [p] st p [p] wide, conservative scheduling issue load per cycle Cycle : Getting somewhere. wide, conservative scheduling issue load per cycle Cycle : Etc... 09 0 ld [p] p ld [p] p ld [p] p ld [p] p add p,p p add p,p p st p [p] 9 st p [p] 9 ld [p] p ld [p] p ld [p] p9 9 ld [p] p9 9 add p,p9 p add p,p9 p st p [p] st p [p] wide, conservative scheduling issue load per cycle Cycle 9: Etc... wide, conservative scheduling issue load per cycle Cycle : Yawn

ld [p] p ld [p] p ld [p] p ld [p] p add p,p p add p,p p st p [p] 9 st p [p] 9 ld [p] p ld [p] p ld [p] p9 9 ld [p] p9 9 add p,p9 p add p,p9 p st p [p] st p [p] wide, conservative scheduling issue load per cycle Cycle : Stretch wide, conservative scheduling issue load per cycle Cycle : Zzzzzz ld [p] p ld [p] p ld [p] p ld [p] p add p,p p add p,p p st p [p] 9 st p [p] 9 ld [p] p ld [p] p ld [p] p9 9 ld [p] p9 9 add p,p9 p add p,p9 p st p [p] st p [p] wide, conservative scheduling issue load per cycle Cycle : -wide ooo = -wide inorder I am going to cry. wide, conservative scheduling issue load per cycle What was # waiting for?? Can I speculate? Load Speculation Speculation requires two things.. Detection of mis-speculations How can we do this? Recovery from mis-speculations Squash from offending load Saw how to squash from branches: same method Load Queue flush? Detects ld ordering violations store position Execute load: write addr to LQ load queue (LQ) Also note any store forwarded from head address Execute store: search LQ Younger load with same age addr? tail Didn t forward from younger store? SQ head tail Data Cache

Store Queue + Load Queue Store Queue: handles forwarding Written by stores (@ execute) Searched by loads (@ execute) Read SQ when you write to the data cache (@ commit) Load Queue: detects ordering violations Written by loads (@ execute) Searched by stores (@ execute) ld [p] p ld [p] p add p,p p st p [p] ld [p] p ld [p] p9 add p,p9 p st p [p] Both together Allows aggressive load scheduling Stores don t constrain load execution wide, aggressive scheduling issue load per cycle Cycle : Speculatively execute # before the store (#). 9 0 ld [p] p ld [p] p ld [p] p ld [p] p add p,p p add p,p p st p [p] st p [p] 9 ld [p] p ld [p] p 9 ld [p] p9 ld [p] p9 0 add p,p9 p add p,p9 p 9 0 st p [p] st p [p] 9 0 wide, aggressive scheduling issue load per cycle Cycle : Speculatively execute # before the store (#). wide, aggressive scheduling issue load per cycle Fast forward: cycles faster Actually ooo this time! Aggressive Load Scheduling Allows loads to issue before older stores Increases out-of-orderness + When no conflict, increases performance - Conflict squash worse performance than waiting Some loads might forward from stores Always aggressive will squash a lot Can we have our cake AND eat it too? Predictive Load Scheduling Predict which loads must wait for stores Fool me once, shame on you fool me twice? Loads default to aggressive Keep table of load PCs that have been caused squashes Schedule these conservatively + Simple predictor Makes bad loads wait for all older stores: not great More complex predictors used in practice Predict which stores loads should wait for

Out of Order: Window Size Scheduling scope = ooo window size Larger = better Constrained by physical registers (#preg) ROB roughly limited by #preg = ROB size + #logical registers Big register file = hard/slow Constrained by issue queue Limits number of un-executed instructions CAM = can t make big (power + area) Constrained by load + store queues Limit number of loads/stores CAMs Active area of research: scaling window sizes Usefulness of large window: limited by branch prediction 9% branch mis-prediction rate: in 0 branches, in 00 insns Out of Order: Benefits Allows speculative re-ordering Loads / stores Branch prediction Schedule can change due to cache misses Different schedule optimal from on cache hit Done by hardware Compiler may want different schedule for different hw configs Hardware has only its own configuration to deal with Static vs. Dynamic Scheduling If we can do this in software why build complex (slow-clock, high-power) hardware? + Performance portability Don t want to recompile for new machines + More information available Memory addresses, branch directions, cache misses + More registers available Compiler may not have enough to schedule well + Speculative memory operation re-ordering Compiler must be conservative, hardware can speculate But compiler has a larger scope Compiler does as much as it can (not much) Hardware does the rest Out of Order: Top Things to Know Register renaming How to perform it and how to recover it Commit Precise state (ROB) How/when registers are freed Issue/Select Wakeup: CAM Choose N oldest ready instructions Stores Write at commit Forward to loads via SQ Loads Conservative/aggressive/predictive scheduling Violation detection via LQ Summary: Dynamic Scheduling Dynamic scheduling Totally in the hardware Also called out-of-order execution (OoO) Fetch many instructions into instruction window Use branch prediction to speculate past (multiple) branches Flush pipeline on branch misprediction Rename to avoid false dependencies Execute instructions as soon as possible Register dependencies are known Handling memory dependencies more tricky Commit instructions in order Anything strange happens pre-commit, just flush the pipeline Current machines: 00+ instruction scheduling window 9