Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Similar documents
ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Scoreboard information (3 tables) Four stages of scoreboard control

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Static vs. Dynamic Scheduling

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction-Level Parallelism and Its Exploitation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Four Steps of Speculative Tomasulo cycle 0

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Adapted from David Patterson s slides on graduate computer architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

The basic structure of a MIPS floating-point unit

Processor: Superscalars Dynamic Scheduling

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Hardware-based Speculation

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Lecture-13 (ROB and Multi-threading) CS422-Spring

Metodologie di Progettazione Hardware-Software

5008: Computer Architecture

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

Hardware-Based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Hardware-based Speculation

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction Level Parallelism

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Chapter 4 The Processor 1. Chapter 4D. The Processor

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Handout 2 ILP: Part B

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Instruction Level Parallelism

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

EECC551 Exam Review 4 questions out of 6 questions

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Tomasulo s Algorithm

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Superscalar Architectures: Part 2

Super Scalar. Kalyan Basu March 21,

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Instruction Level Parallelism (ILP)

ILP: Instruction Level Parallelism

Multiple Instruction Issue and Hardware Based Speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

DYNAMIC SPECULATIVE EXECUTION

Case Study IBM PowerPC 620

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Instruction Level Parallelism. Taken from

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 4: Introduction to Advanced Pipelining

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Course on Advanced Computer Architectures

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

CSE 502 Graduate Computer Architecture

RECAP. B649 Parallel Architectures and Programming

EECC551 Review. Dynamic Hardware-Based Speculation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

What is instruction level parallelism?

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Getting CPI under 1: Outline

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

Hardware-Based Speculation

Multi-cycle Instructions in the Pipeline (Floating Point)

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Transcription:

Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8) Advanced techniques (2.9)

Lectures 1. Introduction 2. Instruction-level Parallelism, part 1 3. Instruction-level Parallelism,,part 2 4.Memory Hierarchies 5. Multiprocessors and Thread-Level Parallelism 6. System Aspects and Virtualization 7. Summary and Review

Better performance in pipeline for (i=1000; i>0; i=i-1) x[i] [] = x[i] [] + 10.0; loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, loop NOP Basic pipeline with many stalls Dynamic scheduling, prediction, speculation, multiple-issue, l i etc...

General Processor Organization Memory access Fetch instruction Get operands & Issue Integer & Logic Update state Floating point Major bottlenecks Control hazards, memory performance => Fetch bottleneck Data hazards, structural hazards, control hazards => Issue bottleneck

Fetch Bottleneck Control hazards Dynamic branch prediction: Predict outcome of branches and jumps Branch target buffers Memory bottleneck Memory performance improvement (memory hierarchy, prefetch, )

Issue Bottleneck RAW hazards Dynamic scheduling (out-of-order execution) WAR & WAW hazards Remove name dependencies (register renaming) Structural hazards Dynamic scheduling (out-of-order execution) Memory performance improvement (memory hierarchy, prefetch, non-blocking, load/store queues) Multiple and pipelined functional units Control hazards Speculative execution Single issue Issue multiple instructions per cycle (superscalar, VLIW)

Dynamic Instruction Scheduling (Ch. 2.4) Key idea: Allow subsequent independent instructions to proceed Instr. gets stuck here DIVD F0,F2,F4 ; takes long time ADDD F10F0F8 F10,F0,F8 ; stalls waiting for F0 SUBD F12,F8,F13 ; Let this instr. bypass the ADDD Enables out-of-order execution => out-of-order completion IF ID EX M WB Two historical schemes used in recent machines: Scoreboard dates back to CDC 6600 in 1963 Tomasulo s s algorithm in IBM 360/91 in 1967

Tomasulo s s Algorithm

Basic Ideas Decouple issue from operand fetch Prevents stall due to RAW hazards Register renaming: Translate register references to instruction (functional unit) references Prevents WAR and WAW hazards

Three Stages of Tomasulo s Alg. 1. Issue get g instruction from FP Op Queue Issue if no structural hazard for a reservation station 2. Execution operate operate on operands (EX) Execute when both operands are available; if not ready, watch Common Data Bus (CDB) for result 3. Wi Write result finish i h execution (WB) Write on CDB to all awaiting functional units; mark reservation station available rmal bus: data + destination Common Data Bus: data + source

Tomasulo and Dynamic Branch Prediction Tomasulo s s algorithm assumes instruction completed when result is written Dynamic branch prediction allows instructions to be speculatively issued after a branch until branch has executed With dynamic scheduling according to Tomasulo s algorithm, instructions following a predicted branch must not execute and write result until prediction is verified!

Hardware-Based Speculative Execution Tomasulo s s algorithm provides speculative issue, but not speculative execution This may be a serious bottleneck, especially for programs with a high branch frequency Speculative execution requires Separate execution from commit Keep track of temporary results Commit instructions in program order

HW Support for Speculation Need a reorder buffer for uncommited inst. Reorder buffer (ROB) can be operand source Once operation commits, the register file is updated Use reorder buffer number instead of reservation station Instructions commit in order Flush reorder buffer when a branch is mispredicted Store buffers integrated into the ROB.

Four Steps of a Speculative Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer nr. for destination 2. Execution operate on operands (EX) If both operands ready: execute; if not, watch CDB for result; when both operands are in reservation station: execute 3. Write result finish execution Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available 4. Commit update register with reorder result When instr. is at head of reorder buffer & result is present; update register with result (or store to memory) and remove instr. from reorder buffer

ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 L.D F2, 44(R3) Write result F2 M[44+Reg[R3]] #3 MUL.D F0, F2, F4 Execute F0 #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #3 #2 #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 L.D 44 Reg[R3] #2 44+Rg[R3] Add1 Add2 Add3 Mult1 MUL.D Reg[F4] #2 #3 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5

ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 MUL.D F0, F2, F4 Execute F0 #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #3 #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 MUL.D M[44+Reg[R2]] Reg[F4] #3 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5

ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 MUL.D F0, F2, F4 Write result F0 #2 x Reg[F4] #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #3 #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 MUL.D M[44+Reg[R2]] Reg[F4] #3 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5

ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 no MUL.D F0, F2, F4 Committed F0 #2 x Reg[F4] #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5

ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 no MUL.D F0, F2, F4 Committed F0 #2 x Reg[F4] #4 SUB.D F8, F2, F6 Committed F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #6 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5

Arithmetic/Logic Operation Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Tag destination register with ROB entry Write destination register to ROB entry Mark ROB entry as busy 2. Execute after issue Wait for operand values on CDB (if not already available) Compute result 3. Write result when CDB and ROB available Send result on CDB to reservation stations Update ROB entry with result, and mark as ready Free reservation station 4. Commit when at head of ROB and ready Update destination register with result from ROB entry Untag destination register Free ROB entry

Branch Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Write destination address and outcome prediction to ROB entry Mark ROB entry as busy 2. Execute after issue Wait for operand values on CDB (if not already available) Compute result (branch condition) 3. Write result when ROB available Update ROB entry with result, and mark as ready Free reservation station 4. Commit when at head of ROB and ready Update branch predictors with result If result did not agree with prediction Flush ROB, reservation stations, and fetch queue Send correct next instruction address to instruction fetch unit Else, free ROB entry

Load Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Tag destination register with ROB entry Write destination register to ROB entry Mark ROB entry as busy 2. Execute step 1 after issue Wait for base address register value on CDB (if not already available) Compute address 3. Execute step 2 Wait if previous store to the same address (or with yet unknown address) is in the ROB Read result from memory 4. Write result when CDB and ROB available Send result on CDB to reservation stations Update ROB entry with result, and mark as ready Free reservation station 5. Commit when at head of ROB and ready Update destination register with result from ROB entry Untag destination register Free ROB entry

Store Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Mark ROB entry as busy 2. Execute after issue Wait for operand values on CDB (if not already available) Compute address and store it in ROB entry 3. Write result when CDB and ROB available Update ROB entry with source register value, and mark as ready Free reservation station 4. Commit when at head of ROB and ready Write result (source register value) to memory at computed address Free ROB entry

Exception Handling Using a ROB solves the problem of precise exceptions! Mark each instruction in the ROB with information about any exceptions caused by it Do not act on exceptions until commit If exception is detected at commit, treat the instruction (almost) like a mispredicted branch Flush the ROB and fetch queues Start t fetching instructions ti from the exception handler Program exception behavior will be preserved!

General Processor Organization ROB Register ROB Memory file CDB Branch predictor RS Memory access Fetch instruction Fetch queue Get operands & Issue RS Integer & Logic Write result RS Floating point Dynamic scheduling with speculative execution

Multiple Issue Memory access Fetch instruction Get operands & Issue Integer & Logic Update state Floating point Issue several instructions per cycle

Multiple Issue Superscalar Issue variable number of instructions per cycle depending on hazards Dynamic superscalar Schedule the instructions dynamically Static superscalar Do not schedule dynamically Very Long Instruction Word (VLIW) Issue fixed number of instructions per cycle Rely on static scheduling only

VLIW Requires wider instructionsi Simpler control logic Difficult to find a sufficient number of instructions to issue Code becomes hardware dependent

Dynamic Superscalar with Speculative Execution ROB Register ROB Memory file CDB Branch predictor RS Memory access Fetch instruction Fetch queue Get operands & Issue RS Integer & Logic Write result RS Floating point Relatively easy to extend up to 2-4 issues per cycle

Advanced Techniques Branch-Target Buffers (BTB) Return Address Predictors Register Renaming

Branch Target Buffer Dynamic branch prediction provides fetch unit with prediction taken/not taken Branch Target Buffer (BTB) Stores predicted address of next instruction for taken branches Functions as a cache memory that is indexed by addresses of taken branches Without t BTB, no instruction ti can be fetched after a predicted taken branch until branch address has been computed => potentially long stalls With BTB, stall time for a correctly predicted branch can become zero (if hit in BTB)

Return Address Predictor BTB does not work well for subroutine returns and indirect jumps Calls can be made from many different places Return address predictors can help subroutine returns Push return address on small stack when call is detected Works perfectly as long as the stack is deep enough

Register Renaming ROB and/or reservation stations in dynamic scheduling provide register renaming Each instruction is provided with a unique location to store its result => WAR and WAW hazards avoided When committed, the result is always written to the register file With pure register renaming used in most high-end modern processors A larger set of physical registers is available than there are architectural registers Each instruction is assigned a free physical register A mapping is maintained between physical and architectural registers Mappings to a physical register can be marked speculative until the instruction commits, and then become permanent or removed This avoids the intermediate stages of storing results first in reservation stations, then in ROB, and then in registers

Summary: Fetch Bottleneck Control hazards Dynamic branch prediction: Predict outcome of branches and jumps Branch target buffers Memory bottleneck Memory performance improvement (memory hierarchy, prefetch, )

Summary: Issue Bottleneck RAW hazards Dynamic scheduling (out-of-order execution) WAR & WAW hazards Remove name dependencies (register renaming) Structural hazards Dynamic scheduling (out-of-order execution) Memory performance improvement (memory hierarchy, prefetch, non-blocking, load/store queues) Multiple and pipelined functional units Control hazards Speculative execution Single issue Issue multiple instructions per cycle (superscalar, VLIW)

Summary Next lecture (4) we will look at what can be done about memory performance (chapter 5: 5.1-5.2)