Hardware-Based Speculation

Similar documents
Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Static vs. Dynamic Scheduling

Hardware-Based Speculation

Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Hardware-based Speculation

ILP: Instruction Level Parallelism

CSE 502 Graduate Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

5008: Computer Architecture

Super Scalar. Kalyan Basu March 21,

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Getting CPI under 1: Outline

Four Steps of Speculative Tomasulo cycle 0

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Hardware-based Speculation

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

The basic structure of a MIPS floating-point unit

EECC551 Exam Review 4 questions out of 6 questions

Handout 2 ILP: Part B

Instruction Level Parallelism

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Instruction-Level Parallelism and Its Exploitation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

TDT 4260 lecture 7 spring semester 2015

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Lecture-13 (ROB and Multi-threading) CS422-Spring

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Homework 2 (Chapter 3)

Course on Advanced Computer Architectures

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Dynamic Scheduling. CSE471 Susan Eggers 1

Multiple Instruction Issue and Hardware Based Speculation

CS433 Homework 2 (Chapter 3)

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Metodologie di Progettazione Hardware-Software

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS425 Computer Systems Architecture

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Updated Exercises by Diana Franklin

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Chapter 4 The Processor 1. Chapter 4D. The Processor

Processor: Superscalars Dynamic Scheduling

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Lecture: Out-of-order Processors

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

3.16 Historical Perspective and References

RECAP. B649 Parallel Architectures and Programming

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

INSTRUCTION LEVEL PARALLELISM

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Advanced Computer Architecture

TDT 4260 TDT ILP Chap 2, App. C

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction-Level Parallelism (ILP)

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Transcription:

Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative Need an additional piece of hardware to prevent any irrevocable action until an instruction commits Need to separate executing the instruction to pass data to other instructions from completing (performing operations that can not be undone) Reorder Buffer Reorder buffer holds the result of instruction between completion and commit (and supply them to any instruction who needs them just like the RS in Tomasulo s) Four fields: Instruction type: branch/store/register Destination field: register number or memory address Value field: output value Ready field: completed execution? Modify reservation stations: Operand source is now reorder buffer instead of functional unit (results are tagged with ROB entry #) 1 Prediction Prediction 2Branch

Reorder Buffer Register values and memory values are not written until an instruction commits On misprediction: Speculated entries in ROB are cleared Exceptions: Not recognized until it is ready to commit 4 stages Issue Execute Write Result Commit Reorder Buffer Issue If empty RS and ROB entry Issue; else stall Send operands to RS if available in registers or ROB The number of ROB entry allocated to instruction is sent to RS to tag the results with If operands are not available yet, the ROB entry is sent to the RS to wait for results on the CDB 3 4Branch Prediction

Reorder Buffer Execute If one or more operands are not available, monitor the CDB. When the result is broadcast on the CDB (we know that from the ROB entry tag) copy it When all operands are ready, start execution Write Result When execution is completed, broadcast the result on the CDB tagged with ROB entry # Results are copied to ROB entry and all waiting RS 5 Reorder buffer When an instruction reaches the head of the ROB and the result is ready in the buffer, write it to the register file and remove instruction from ROB If the instruction is a store, write it to the memory and remove the instruction from the ROB If the instruction is a branch, if prediction is correct, remove it from the ROB. If misprediction flush the ROB and start from the correct successor. 6

Multiple Issue and Static Scheduling To achieve CPI < 1, need to complete multiple instructions per clock Solutions: Statically scheduled superscalar processors VLIW (very long instruction word) processors dynamically scheduled superscalar processors Multiple Issue 7 Issue and Static Scheduling Issue and Static Scheduling 8Multiple

VLIW Processors Package multiple operations into one instruction Example VLIW processor: One integer instruction (or branch) Two independent floating-point operations Two independent memory references Multiple Issue and Static Scheduling Must be enough parallelism in the code to fill the available slots 9 VLIW Processors Disadvantages: Statically finding parallelism Code size No hazard detection hardware Binary code compatibility Multiple Issue and Static Scheduling 10

VLIW Example Source instruction Instruction using result Latency FP ALU OP FP ALU OP 3 FP ALU OP Store double 2 Load double FP ALU OP 1 Load Double Store double 0 Loop: L.D ADD.D S.D DADDUI BNE R F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#-8 1,R2,Loop For (I=1000;I>0;I++) x[i]=x[i]+s; 11 VLIW Example Assume that w can schedule 2 memory operations, 2 FP operations, and one integer or branch Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 DADD R1,R1,#-56 7 SD 24(R1),F20 SD 16(R1),F24 8 SD 8(R1),F28 BNEZ R1,LOOP 9 12

Dynamic Scheduling, Multiple Issue, and Speculation Modern microarchitectures: Dynamic scheduling + multiple issue + speculation Two approaches: Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Hybrid approaches Issue logic can become bottleneck Dynamic Scheduling, Multiple Issue, and Speculation 13 Overview of Design Dynamic Scheduling, Multiple Issue, and Speculation 14

Multiple Issue Limit the number of instructions of a given class that can be issued in a bundle I.e. on FP, one integer, one load, one store Examine all the dependencies amoung the instructions in the bundle If dependencies exist in bundle, encode them in reservation stations Also need multiple completion/commit Dynamic Scheduling, Multiple Issue, and Speculation 15 Example Loop: LD R2,0(R1) ;R2=array element DADDIU R2,R2,#1 ;increment R2 SD R2,0(R1) ;store result DADDIU R1,R1,#8 ;increment pointer BNE R2,R3,LOOP ;branch if not last element Dynamic Scheduling, Multiple Issue, and Speculation 16

Example (No Speculation) Dynamic Scheduling, Multiple Issue, and Speculation 17 Example Dynamic Scheduling, Multiple Issue, and Speculation 18

Branch-Target Buffer Need high instruction bandwidth! Branch-Target buffers Next PC prediction buffer, indexed by current PC 19 Branch Folding Optimization: Larger branch-target buffer Add target instruction into buffer to deal with longer decoding time required by larger buffer Branch folding 20

Return Address Predictor Most unconditional branches come from function returns The same procedure can be called from multiple sites Causes the buffer to potentially forget about the return address from previous calls Create return address buffer organized as a stack 21 Integrated Instruction Fetch Unit Design monolithic unit that performs: Branch prediction Instruction prefetch Fetch ahead Instruction memory access and buffering Deal with crossing cache lines 22

Register Renaming Register renaming vs. reorder buffers Instead of virtual registers from reservation stations and reorder buffer, create a single register pool Contains visible registers and virtual registers Use hardware-based map to rename registers during issue WAW and WAR hazards are avoided Speculation recovery occurs by copying during commit Still need a ROB-like queue to update table in order Simplifies commit: Record that mapping between architectural register and physical register is no longer speculative Free up physical register used to hold older value In other words: SWAP physical registers on commit Physical register de-allocation is more difficult 23 Integrated Issue and Renaming Combining instruction issue with register renaming: Issue logic pre-reserves enough physical registers for the bundle (fixed number?) Issue logic finds dependencies within bundle, maps registers as necessary Issue logic finds dependencies between current bundle and already in-flight bundles, maps registers as necessary 24

How Much? How much to speculate Mis-speculation degrades performance and power relative to no speculation May cause additional misses (cache, TLB) Prevent speculative code from causing higher costing misses (e.g. L2) Speculating through multiple branches Complicates speculation recovery No processor can resolve multiple branches per cycle 25 Energy Efficiency Speculation and energy efficiency Note: speculation is only energy efficient when it significantly improves performance Value prediction Uses: Loads that load from a constant pool Instruction that produces a value from a small set of values Not been incorporated into modern processors Similar idea--address aliasing prediction--is used on some processors 26