CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Similar documents
Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

November 7, 2014 Prediction

CS 152, Spring 2011 Section 8

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

EE 660: Computer Architecture Superscalar Techniques

Out of Order Processing

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1


Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-Based Speculation

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS 152, Spring 2012 Section 8

Handout 2 ILP: Part B

Superscalar Processors

Lecture 4 - Pipelining

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Hardware-based Speculation

Lecture 4 Pipelining Part II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Lecture 9 - Virtual Memory

ECE/CS 552: Introduction to Superscalar Processors

Lecture 14: Multithreading

Advanced Computer Architecture

Lecture: Out-of-order Processors

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

5008: Computer Architecture

Hardware-based Speculation

Computer Architecture ELEC3441

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

CS 152 Computer Architecture and Engineering

Complex Pipelines and Branch Prediction

E0-243: Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Instruction Level Parallelism

Static vs. Dynamic Scheduling

Lecture-13 (ROB and Multi-threading) CS422-Spring

Instruction-Level Parallelism and Its Exploitation

Chapter 4 The Processor 1. Chapter 4D. The Processor

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Advanced Computer Architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Superscalar Processor

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

ECE 505 Computer Architecture

Lecture 7 - Memory Hierarchy-II

" # " $ % & ' ( ) * + $ " % '* + * ' "

Adapted from David Patterson s slides on graduate computer architecture

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

The Processor: Instruction-Level Parallelism

CS252 Graduate Computer Architecture Spring 2014 Lecture 6: Modern Out- of- Order Processors

Lecture 19: Instruction Level Parallelism

Case Study IBM PowerPC 620

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

COMPUTER ORGANIZATION AND DESI

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Transcription:

CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152

Last time in Lecture 11 Register renaming removes WAR, WAW hazards In-order fetch/decode, out-of-order execute, in-order commit gives high performance and precise exceptions Need to rapidly recover on branch mispredictions Unified physical register file machines remove data values from ROB All values only read and written during execution Only register tags held in ROB 2

Question of the Day How many in-order cores do you think take up the same area as an out-of-order core? 3

Reminder Out-of-Order In-Order PC Fetch Decode & Rename Reorder Buffer Commit In-Order Physical Reg. File Branch Unit Execute ALU MEM Store Buffer D$ 4

Separate Pending Instruction Window from ROB The instruction window holds instructions that have been decoded and renamed but not issued into execution. Has register tags and presence bits, and pointer to ROB entry. Reorder buffer used to hold exception information for commit. use ex Ptr 2 next to commit Ptr 1 next available Instructions that committed are not in the instruction window any more op p1 PR1 p2 PR2 PRdROB# Done? Rd LPRd PC Except? ROB is usually several times larger than instruction window why? 5

Reorder Buffer Holds Active Instructions (Decoded but not Committed) ld x1, (x3) add x3, x1, x2 sub x6, x7, x9 add x3, x3, x6 ld x6, (x1) add x6, x6, x3 sd x6, (x1) ld x6, (x1) (Older instructions) (Newer instructions) Commit Execute Fetch ld x1, (x3) add x3, x1, x2 sub x6, x7, x9 add x3, x3, x6 ld x6, (x1) add x6, x6, x3 sd x6, (x1) ld x6, (x1) Cycle t Cycle t + 1 6

Issue Timing i1 Add R1,R1,#1 Issue 1 Execute 1 i2 Sub R1,R1,#1 Issue 2 Execute 2 How can we issue earlier? Using knowledge of execution latency (bypass) i1 Add R1,R1,#1 Issue 1 Execute 1 i2 Sub R1,R1,#1 Issue 2 Execute 2 What makes this schedule fail? If execution latency wasn t as expected

Issue Queue with latency prediction Inst# use exec op p1 lat1 src1 p2 lat2 src2 dest ptr 2 next to commit BEQZ Speculative Instructions ptr 1 next available Issue Queue (Reorder buffer) Fixed latency: latency included in queue entry ( bypassed ) Predicted latency: latency included in queue entry (speculated) Variable latency: wait for completion signal (stall)

Improving Instruction Fetch Performance of speculative out-of-order machines often limited by instruction fetch bandwidth speculative execution can fetch 2-3x more instructions than are committed mispredict penalties dominated by time to refill instruction window taken branches are particularly troublesome

Increasing Taken Branch Bandwidth (Alpha 21264 I-Cache) PC Generation Branch Prediction Instruction Decode Validity Checks PC Line Predict Way Predict Cached Instructions Tag Way 0 Tag Way 1 4 insts fast fetch path Fold 2-way tags and BTB into predicted next block Take tag checks, inst. decode, branch predict out of loop Raw RAM speed on critical loop (1 cycle at ~1 GHz) 2-bit hysteresis counter per block prevents overtraining =? =? Hit/Miss/Way

Tournament Branch Predictor (Alpha 21264) Local history table (1,024x10b) Local prediction (1,024x3b) Global Prediction (4,096x2b) PC Choice Prediction (4,096x2b) Prediction Global History (12b) Choice predictor learns whether best to use local or global branch history in predicting next branch (best in each case) Global history is speculatively updated but restored on mispredict Claim 90-100% success on range of applications

Taken Branch Limit Integer codes have a taken branch every 6-9 instructions To avoid fetch bottleneck, must execute multiple taken branches per cycle when increasing performance This implies: predicting multiple branches per cycle fetching multiple non-contiguous blocks per cycle

Branch Address Cache (Yeh, Marr, Patt) Entry PC Valid predicted target #1 len predicted target #2 k PC = match valid target#1 len#1 target#2 Extend BTB to return multiple branch predictions per cycle

Fetching Multiple Basic Blocks Requires either multiported cache: expensive interleaving: bank conflicts will occur Merging multiple blocks to feed to decoders adds latency increasing mispredict penalty and reducing branch throughput

Trace Cache Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR BR BR BR BR BR Single fetch brings in multiple basic blocks Trace cache indexed by start address and next n branch predictions Used in Intel Pentium-4 processor to hold decoded uops

Write Ports Superscalar Register Renaming During decode, instructions allocated new physical destination register Source operands renamed to physical register with newest value Execution unit only sees physical register numbers Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2 Update Mapping Read Addresses Rename Table Read Data Register Free List Issue multiple instructions per cycle Op PDest PSrc1 PSrc2 Op PDest PSrc1 PSrc2 Does this work? 16

Write Ports Superscalar Register Renaming Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2 Update Mapping Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename Op lookup. PDest Read Addresses Rename Table Read Data PSrc1 PSrc2 Op =? PDest =? PSrc1 PSrc2 Register Free List MIPS R10K renames 4 serially-raw-dependent insts/cycle 17

Speculative Loads / Stores Just like register updates, stores should not modify the memory until after the instruction is committed - A speculative store buffer is a structure introduced to hold speculative store data. 18

Speculative Store Buffer V S V S V S V S V S V S Tags Store Address Tag Tag Tag Tag Tag Tag Speculative Store Buffer Store Commit Path Data Data Data Data Data Data L1 Data Cache Store Data Data Just like register updates, stores should not modify the memory until after the instruction is committed. A speculative store buffer is a structure introduced to hold speculative store data. During decode, store buffer slot allocated in program order Stores split into store address and store data micro-operations Store address execute writes tag Store data execute writes data Store commits when oldest instruction and both address and data available: clear speculative bit and eventually move data to cache On store abort: clear valid bit 19

Load bypass from speculative store buffer Speculative Store Buffer Load Address L1 Data Cache V V V V V V S S S S S S Tag Tag Tag Tag Tag Tag Data Data Data Data Data Data Tags Data Load Data If data in both store buffer and cache, which should we use? Speculative store buffer If same address in store buffer twice, which should we use? Youngest store older than load 20

Memory Dependencies sd x1, (x2) ld x3, (x4) When can we execute the load? 21

In-Order Memory Queue Execute all loads and stores in program order => Load and store cannot leave ROB for execution until all previous loads and stores have completed execution Can still execute loads and stores speculatively, and out-oforder with respect to other instructions Need a structure to handle memory ordering 22

Conservative O-o-O Load Execution sd x1, (x2) ld x3, (x4) Can execute load before store, if addresses known and x4!= x2 Each load address compared with addresses of all previous uncommitted stores can use partial conservative check i.e., bottom 12 bits of address, to save hardware Don t execute load if any previous store address not known (MIPS R10K, 16-entry address queue) 23

Address Speculation Guess that x4!= x2 sd x1, (x2) ld x3, (x4) Execute load before store address known Need to hold all completed but uncommitted load/store addresses in program order If subsequently find x4==x2, squash load and all following instructions => Large penalty for inaccurate address speculation 24

Memory Dependence Prediction (Alpha 21264) sd x1, (x2) ld x3, (x4) Guess that x4!= x2 and execute load before store If later find x4==x2, squash load and all following instructions, but mark load instruction as store-wait Subsequent executions of the same load instruction will wait for all previous stores to complete Periodically clear store-wait bits 25

Datapath: Branch Prediction and Speculative Execution Branch Prediction kill kill Branch Resolution kill kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit Execute ALU MEM Store Buffer 26 D$

Instruction Flow in Unified Physical Register File Pipeline Fetch Get instruction bits from current guess at PC, place in fetch buffer Update PC using sequential address or branch predictor (BTB) Decode/Rename Take instruction from fetch buffer Allocate resources to execute instruction: Destination physical register, if instruction writes a register Entry in reorder buffer to provide in-order commit Entry in issue window to wait for execution Entry in memory buffer, if load or store Decode will stall if resources not available Rename source and destination registers Check source registers for readiness Insert instruction into issue window+reorder buffer+memory buffer 27

Memory Instructions Split store instruction into two pieces during decode: Address calculation, store-address Data movement, store-data Allocate space in program order in memory buffers during decode Store instructions: Store-address calculates address and places in store buffer Store-data copies store value into store buffer Store-address and store-data execute independently out of issue window Stores only commit to data cache at commit point Load instructions: Load address calculation executes from window Load with completed effective address searches memory buffer Load instruction may have to wait in memory buffer for earlier store ops to resolve 28

Issue Stage Writebacks from completion phase wakeup some instructions by causing their source operands to become ready in issue window In more speculative machines, might wake up waiting loads in memory buffer Need to select some instructions for issue Arbiter picks a subset of ready instructions for execution Example policies: random, lower-first, oldest-first, critical-first Instructions read out from issue window and sent to execution 29

Execute Stage Read operands from physical register file and/or bypass network from other functional units Execute on functional unit Write result value to physical register file (or store buffer if store) Produce exception status, write to reorder buffer Free slot in instruction window 30

Commit Stage Read completed instructions in-order from reorder buffer (may need to wait for next oldest instruction to complete) If exception raised flush pipeline, jump to exception handler Otherwise, release resources: Free physical register used by last writer to same architectural register Free reorder buffer slot Free memory reorder buffer slot 31

Question of the Day How many in-order cores do you think take up the same area as an out-of-order core? 32

How Many In-Order In The Same Area? 33

Performance? 34

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 35