Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Similar documents
Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Hardware-based Speculation

Hardware-based Speculation

Static vs. Dynamic Scheduling

Hardware-Based Speculation

The basic structure of a MIPS floating-point unit

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Handout 2 ILP: Part B

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Super Scalar. Kalyan Basu March 21,

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Copyright 2012, Elsevier Inc. All rights reserved.

5008: Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Adapted from David Patterson s slides on graduate computer architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Four Steps of Speculative Tomasulo cycle 0

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP: Instruction Level Parallelism

COSC4201 Instruction Level Parallelism Dynamic Scheduling

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 4 The Processor 1. Chapter 4D. The Processor

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CSE 502 Graduate Computer Architecture

EECC551 Exam Review 4 questions out of 6 questions

Instruction Level Parallelism

Instruction-Level Parallelism and Its Exploitation

Tomasulo s Algorithm

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Instruction Level Parallelism

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Multi-cycle Instructions in the Pipeline (Floating Point)

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Scoreboard information (3 tables) Four stages of scoreboard control

E0-243: Computer Architecture

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Hardware-Based Speculation

Processor: Superscalars Dynamic Scheduling

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Metodologie di Progettazione Hardware-Software

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced issues in pipelining

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Multiple Instruction Issue and Hardware Based Speculation

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Superscalar Architectures: Part 2

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Getting CPI under 1: Outline

1. Chapter 3: Exploiting ILP

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Level Parallelism (ILP)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Processor (IV) - advanced ILP. Hwansoo Han

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Case Study IBM PowerPC 620

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution

Transcription:

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1

Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would be frequent: You may need to execute 1 branch/cycle Cannot start execution until branch is resolved Solution: Speculate the outcome of the branch and execute as if the guess was correct. Note the extension over previous branch prediction: we now allow EX This is called hardware speculation Requires the ability to recover from mispredictions Hardware-based speculation combines: Dynamic branch prediction Speculation to allow EX Aggressive dynamic scheduling across basic blocks (because now we aggressively execute past basic blocks) Copyright J. Torrellas 1999,2001,2002,2007 2

Speculative Execution Based on Tomasulo Used in most processors: IBM, Intel, AMD Key: separate the bypassing of results among instructions from full instruction completion Allow a speculative instruction to pass its result to others Not complete the instruction (not allow it to update any undoable state) When the instruction is no longer speculative: allow it to modify the reg file or memory à Instruction Commit Step Overall: instructions execute ooo, but they commit in order Prevent any irrevocable action until an instruction commits Copyright J. Torrellas 1999,2001,2002,2007 3

Reorder Buffer (ROB) Buffer that holds the result of instructions that have completed but not committed Hardware buffer Also used to pass results among instructions Provides additional registers like the Reservation Stations in Tomasulo à extend the register set Difference ROB-Tomasulo Res Stations: Tom: Once the FU writes its result, subsequent instructions find the value in the register set ROB: ROB supplies the results of completed instructions not yet committed Copyright J. Torrellas 1999,2001,2002,2007 4

ROB Fields in an entry: Instruction type: BR (no dest), SD (dest is mem), Other (dest is reg) Destination field: destination register or destination mem address Value: value until commit Ready: has the instruction finished (and therefore the result is ready) We place the ROB instead of the Store Buffs Copyright J. Torrellas 1999,2001,2002,2007 5

Figure 3.11 Copyright J. Torrellas 1999,2001,2002,2007 6

Functions Res Stations: Place to buffer operations and operands from the time I issues until it can execute RS track the ROB assigned for an instruction ROB: Does the renaming Every instruction has a position in the ROB from when it issues until it commits We tag a result using the ROB entry number rather than the RS number Copyright J. Torrellas 1999,2001,2002,2007 7

Overall Instruction Execution ISSUE or DISPATCH: Get an instruction from instruction queue Issue it if there is an empty RS and an empty ROB entry Send ops to the RS if they are available in any reg or ROB entry Send the number of the ROB entry to the RS, so that it can be used to tag the result of the execution when it is dumped in the CDB If no RS or no ROB entry: stall instruction and successors EXECUTE: Monitor for any of your inputs to be available in the CDB (this enforces RAW). If so, read it in the RS When both ops ready and the FU free, execute /*note: some sources call this step issue. We will not use this word here */ May take several cycles Loads require 2 steps (address generation and memory access); stores need only 1 step (address generation) Copyright J. Torrellas 1999,2001,2002,2007 8

Overall Instruction Execution WRITE RESULT: When EX complete, dump result on the bus, together with the ROB entry number. Mark RS as available. The data is sent to other RS waiting for it, and to the correct ROB entry We assume that for Stores, it is at this time that the data is generated. So the value to be stored is also moved to the ROB entry. COMMIT or GRADUATION: When an instruction reaches the head of the ROB and its result is present in the entry à update the Reg file and remove the ROB entry. If the instruction is a Store: Do same except that memory is updated If the instruction is a branch with incorrect prediction: flush the ROB and restart execution at the correct successor If the instruction is a branch with correct prediction: as usual Copyright J. Torrellas 1999,2001,2002,2007 9

Example Example page 187 and Figure 3.12 (when MUL.D in W stage) L.D F6,32(R2) L.D F2,44(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 No need to worry about timing Compare to non-speculative Tomasulo (fig 3.7): there, instructions complete ooo; here, they have to commit in order Copyright J. Torrellas 1999,2001,2002,2007 10

Continuation of the Example Major difference: with the ROB, dynamic execution can maintain a precise exception model. Example: If MULT suffers exception Wait until it reaches the head of the ROB service exception flush all pending instructions Copyright J. Torrellas 1999,2001,2002,2007 11

Example of Branch Misprediction Example page 189 and Figure 3.13 No need to worry about timing After a branch is mispredicted, recover by clear the ROB for all instructions that follow the branch Let branch and the previous instructions to commit fetch at the correct branch successor Exceptions Record the exception in the ROB entry If the excepting instruction is in the mispredicted path, simply squash the instruction and flush the exception. When the excepting instruction reaches the head ROB: process it Copyright J. Torrellas 1999,2001,2002,2007 12

Handling Stores Tomasulo s algorithm: stores update caches/memory as soon as they complete the WB Speculative processor: stores only update caches/memory when they reach the head of the ROB Therefore: memory not updated by a speculative instruction Note: Stores in speculative processor can go through the WRITE RESULTS stage even if the value to be stored is not ready; the value is only needed to COMMIT. When the value is ready, it is stored in the ROB. Copyright J. Torrellas 1999,2001,2002,2007 13

Dependences Through Memory WAW: Eliminated because the updating of memory occurs in order, when the WR is at the head of ROB Therefore, no earlier LD or ST can be pending WAR: same as WAW RAW: Maintained with two restrictions: LD cannot be sent to memory if there is a previous ST in the ROB with the same address LD computes its effective address in order relative to all earlier ST Note: some speculative machines actually allow the bypass of a value from a previous ST to a successor LD without the LD having to go to memory Copyright J. Torrellas 1999,2001,2002,2007 14

Multiple Issue Processors If all techniques described are successful CPI = 1 To reduce CPI below 1 issue multiple inst per cycle superscalar processors (a) VLIW or EPIC processors (b) (a) : issue varying # of inst/clock may be statically scheduled (in-order execution) dynamically scheduled based on Tomasulo (ooo execution) (b): issue fixed #instr/clock (usually as a large instr) statically scheduled See Fig 3.15 Copyright J. Torrellas 1999,2001,2002,2007 15

Superscalars For example, in a cycle, issue 0-8 instructions usually instructions are indep have to satisfy some constraints e.g no more 1 mem ref if an instruction does not meet this issue criteria, do not issue it; do not issue following instr. VLIW : Compiler has responsibility of creating the package of instructions to simultaneously issue. If it cannot find enough, put NOOPS eg. Superscalar : issue an integer + ftpt instruction per cycle Copyright J. Torrellas 1999,2001,2002,2007 16

Multiple Issue with Dynamic Scheduling and Speculation We use Tomasulo For example: issue 2 instructions per cycle. These instructions may even have dependences with each other In general, for n-issue width: done with a combination of two approaches: Run the issue step at higher speed, so that e.g. two instructions can be issued in 1 cycle Build wider logic, so that multiple instructions can be issued at the same time and all dependences are checked. Copyright J. Torrellas 1999,2001,2002,2007 17

Loop LD R2,0(R1) DADDIU R2,R2,#1 SD R2,0(R1) DADDIU R1,R1,#4 BNE R2,R3,Loop Example 2-issue Assume separate FUs for: effective address calc, ALU ops, branch condition eval Assume 2 inst of any type can commit/cycle Copyright J. Torrellas 1999,2001,2002,2007 18

Example 2-issue Figure 3.19: no speculation note: here we assume that LD/ST cannot complete effective address calculation until previous branch resolved. Could save 1 cycle by allowing address calculation but not mem access Figure 3.20: speculation (up to 2 inst can commit/cycle) since branch is key performance limitation, speculation helps! Copyright J. Torrellas 1999,2001,2002,2007 19

Considerations for Spec Machines 1. ROB vs Register Renaming: Using ROB architectural registers contained in combination of register set, reservation stations, and ROB if we do not issue instr for a while: as instructions commit, reg values will appear in the register file Explicit set of physical regs + register renaming: Some ROB entries did not need destination register Physical regs hold both the architecturally visible regs and temp values Extended regs play role of both res stations + ROB WAW and WAR hazards eliminated by renaming the destination register Copyright J. Torrellas 1999,2001,2002,2007 20

Considerations for Spec Machines 2. How much to speculate: Adv of speculation: execute past branches Dsv: may execute a lot of useless instructions Consequently: execute under speculative mode only low-cost events: yes: misses in L1 no: TLB misses. In this case, wait until the instruction causing the event becomes non-speculative Copyright J. Torrellas 1999,2001,2002,2007 21

Considerations for Spec Machines 3. Speculating through multiple branches: Some programs have a high frequency of branches or branch clustering --> speculate across multiple branches Complicates the process of speculation recovery It is hard to speculate on more than one branch per cycle Copyright J. Torrellas 1999,2001,2002,2007 22