EE 457 Unit 9a. Exploiting ILP Out-of-Order Execution

Similar documents
Credits. EE 457 Unit 9a. Outline. 8-Stage Pipeline. Exploiting ILP Out-of-Order Execution

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Handout 2 ILP: Part B

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture-13 (ROB and Multi-threading) CS422-Spring

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

5008: Computer Architecture

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

E0-243: Computer Architecture

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction-Level Parallelism and Its Exploitation

Copyright 2012, Elsevier Inc. All rights reserved.

COSC4201 Instruction Level Parallelism Dynamic Scheduling

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Processor: Superscalars Dynamic Scheduling

COSC 6385 Computer Architecture - Pipelining

LECTURE 10. Pipelining: Advanced ILP

Static vs. Dynamic Scheduling

Chapter 4 The Processor 1. Chapter 4D. The Processor

Adapted from David Patterson s slides on graduate computer architecture

COMPUTER ORGANIZATION AND DESIGN

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

EECC551 Exam Review 4 questions out of 6 questions

Multi-cycle Instructions in the Pipeline (Floating Point)

Chapter 4 The Processor 1. Chapter 4B. The Processor

Multiple Instruction Issue. Superscalars

Out of Order Processing

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Advanced issues in pipelining

Hardware-Based Speculation

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Hardware-based Speculation

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Instruction Level Parallelism

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Course on Advanced Computer Architectures

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Four Steps of Speculative Tomasulo cycle 0

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 4 The Processor 1. Chapter 4A. The Processor

Instruction Pipelining Review

Thomas Polzer Institut für Technische Informatik

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Advanced Computer Architecture

Advanced Computer Architecture

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Super Scalar. Kalyan Basu March 21,

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Scoreboard information (3 tables) Four stages of scoreboard control

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

The basic structure of a MIPS floating-point unit

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Exploitation of instruction level parallelism

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

EE 457 Midterm Summer 14 Redekopp Name: Closed Book / 105 minutes No CALCULATORS Score: / 100

Lecture 19: Instruction Level Parallelism

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Computer Architecture Spring 2016

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

COSC121: Computer Systems. ISA and Performance

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Transcription:

1 EE 457 Unit 9a Exploiting ILP Out-of-Order Execution

2 Credits Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some of the material in this presentation is derived from course notes and slides from Prof. Michel Dubois (USC) Prof. Murali Annavaram (USC) Prof. David Patterson (UC Berkeley)

3 Outline Instruction Level Parallelism In-order pipeline From academic 5-stage pipeline To 8-stage MIPS R4000 pipeline Superscalar, superpipelined Out-of-Order Execution Out-of-order completion (Problem: Exceptions) In-order completion Thread Level Parallelism Chip multithreading (CMT) Chip multiprocessors (CMP)

4 8-Stage Pipeline In 5-stage pipeline, we place the I-Cache and D-Cache in one stage each When we add VM translation (i.e. TLB), cache tag check, and data access we need more stages to balance the delay Single Memory Stage in 5- Stage Pipeline Expand to three stages when TLB and Cache access is balanced D-Cache D-TLB Cache Data Cache Tag Lookup Tag Check

5 8-Stage Pipeline R4000 uses pipelined instruction and data caches The instruction is available at the end of the IS stage but the tag check is done in RF This is harmless since it s fine to start the decode and read registers even if it s an invalid tag so long as we know by the end of the clock For the data cache we must perform the tag check before we can start writing into the register file TLB Data/Tag Access Tag Check TLB Data/Tag Access Tag Check IF IS RF EX DF DS TC WB IM Reg ALU DM Reg

6 Load Dependency Issue Latency LW,0() 8-stage pipeline has a two-cycle load delay Possible since data is available at end of DS and can be forwarded If tag check indicates a miss the pipeline can be backed up (re-execute the instruction when the data is available IM Reg ALU DM Reg IM Reg ALU DM Reg IM Reg ALU DM Reg ADD,, IM Reg ALU DM Reg

SUPERSCALAR & SUPERPIPELINING 7

Superpipelining Superscalar 8 Overview Superscalar = More than 1 instruction completing per clock cycle 2-way superscalar = Proc. that can issue 2 instructions per clock cycle Success is sensitive to ability to find independent instructions to issue in the same cycle Superpipelining = Many small stages to boost clock freq. Success depends of finding instructions to schedule in the shadow of data and control hazards Instruction 1 Instruc. Fetch Instruc. Decode Execute Data Memory Write back Instruction 2 Instruc. Fetch Instruc. Decode Execute Data Memory Write back Superscalar: Executing more than 1 instruction per clock cycle (CPI < 1) Instruction 1 IF1 IF2 ID EX DM1 DM2 DM3 WB Instruction 2 IF1 IF2 ID EX DM1 DM2 DM3 WB Superpipelining: Divide logic into many short stages (Higher Clock Frequency)

LD/ST Slot Integer Slot 9 2-way Superscalar One ALU & Data transfer (LW/SW) instruction can be issued at the same time Instruction Pipeline Stages ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB PC I-Cache 2 instructions Reg. File (4 Read, 2 Write) ALU Addr. Calc. D-Cache

OUT-OF-ORDER EXECUTION 10

11 Instruction Level Parallelism (ILP) Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel. ILP refers to the process of finding instructions from a single program/thread of execution that can be executed in parallel Data flow (data dependencies) is what truly limits ordering Independent instructions (no data dependencies) can be executed at the same time) Control hazards also provide some ordering constraints Program Order (In-order) lw $s3,0($s4) and $t3,$t2,$t3 add $t0,$t0,$s4 or $t5,$t3,$t2 sub $t1,$t1,$t2 beq $t0,$t8,l1 xor $s0,$t1,$s2 Dependency Graph We may perform execution out-of-order LW ADD SUB AND BEQ XOR OR Cycle 1: Cycle 2: Cycle 3: lw $s3,0($s4) / add $t0,$t0,$s4 / sub $t1,$t1,$t2 / and $t3,$t2,$t3 / beq $t0,$t8,l1 / / or $t5,$t3,$t2 / / xor $s0,$t1,$s2 /

12 Out-of-Order Motivation Hide the impact of dynamic events such as a cache miss Out-of-Order (OoO) Execution Let independent instructions behind a stalled instruction execute Important aspect: Completion Ordering Out-of-Order completion: Let the independent instruction that has been executed update state (registers + memory) before the stalled instruction completes Problem: Exception handling In-Order completion: Let the independent instructions execute but buffer their results until the stalled instruction completes

13 Out-of-Order Execution Io = In-order Execution Execution here means producing the results not necessarily writing them to a register or memory Completion means committing/writing the results to register file or memory While we say out-of-order execution we really mean: IoI (IoD) => OoOE => IoC IoI / IoD = In-order Issue/Dispatch IoC = In-order completion

14 IoC or OoC? IoI (IoD) => OoOE => IoC In-order completion is necessary to support exceptions [exact state at time of exception] We will present the concept of OoOC (out-of-order completion) then roll back to IoC OoOC Issues Brancheswe should not commit an instruction that came after (in program order) a branch Solution: Stall dispatching instructions after a branch until we resolve the outcome LW,0() // cache miss BEQ,$0,L1 ADD,, // What if we execute this ADD out of order

15 Basic Blocks Basic Block (def.) = Sequence of instructions that will always be executed together No conditional branches out No branch targets coming in Also called straight-line code Average size: 5-7 instrucs. lw $s3,0($s4) and $t3,$t2,$t3 L1: add $t0,$t0,$s4 or $t5,$t3,$t2 sub $t1,$t1,$t2 beq $t0,$t8,l1 xor $s0,$t1,$s2 Instructions in a basic block can be overlapped if there are no data dependencies Control dependences really limit our window of possible instructions to overlap W/o extra hardware, we can only overlap execution of instructions within a basic block This is a basic block (starts w/ target, ends with branch)

16 Scheduling Strategies Static Scheduling Compiler re-orders instructions in such a way that no dependencies will be violated and allows for OoOE Dynamic Scheduling HW implementing the Tomasulo algorithm or other similar approach will re-order instructions to allow for OoOE More Advanced Concepts Branch prediction and speculative execution (execution beyond a branch flushing if incorrect) will be covered later

17 Static Scheduling Strengths Hardware simplicity [Better clock rate] Power/energy advantage Compiler has a global view of the program anyway, so it should be able to do a good job Very predictable: static performance predictions are reliable Weaknesses Requires re-compilation to take advantage of new/modified architecture Cannot foresee dynamic (data-dependent) events Cache miss, conditional branches (can only recedule instructions in a basic block) Cannot precompute memory addresses No good solution for precise exceptions with out-of-order completion

18 Where to Stall? In 5-stage pipeline (in-order execution) RAW dependency was solved by Forwarding (preferably) or Stalling Dependent instructions stalled in the ID stage if necessary ADD,, stall LW IM Reg ALU DM Reg

Prior ALU Result Data Mem. or ALU result Instruction Register Pipeline Stage Register Pipeline Stage Register D-Cache PC I-Cache Pipeline Stage Register Mem WB WB 19 Where to Stall? Simple 5-stage pipeline: Dependent instructions cannot be stalled in the EX stage Why? What if ADD was also dependent on the instruction in WB ADD has no place to buffer that forwarded value Thus we stall in ID so we can use the Register File to grab dependent values. Further stalling in ID incurs only 1 cycle penalty as would stalling in EX. PCWrite IRWrite 4 + rs rt. 5 5 IF.Flush HDU Control Read Reg. 1 # Read Reg. 2 # Write Reg. # Write Data Stall Read data 1 Read data 2 Register File Ex Mem WB 0 1 2 0 1 2 ALUSelA FLUSH 0 1 Sh. Left 2 0 0 ALU + 0 1 0 1 Zero Res. Branch 0 1 MemRead & MemWrite 0 1 MemToReg Reset Sign Extend 16 32 rs rt rd ALUSelB Forwarding Unit ALUSrc 0 1 Regwrite & WriteReg# Regwrite, WriteReg#

20 Where to Stall? But to implement OoO execution, we cannot stall in the decode stage since that would prevent any further issuing of instructions Thus, now we will issue to queues for each of the multiple functional units and have the instruction stall in the queue until it is ready Queues + Functional Units ALU MUL IM Reg DM Reg DIV Stalling here would plug up the pipeline DM (Cache)

Prior ALU Result Data Mem. or ALU result Instruction Register Pipeline Stage Register Pipeline Stage Register D-Cache PC I-Cache Pipeline Stage Register Mem WB WB 21 Forwarding in OoO Execution In 5-stage pipeline junior instructions carried their source register IDs into the EX stage to be compared with destination register ID s of their seniors (earlier instructions) But in OoO execution, we may have many instructions (seniors) in front of us and cannot afford to perform so many comparisons 0 (as well as handling the 1 FLUSH PCWrite case where many seniors are producing new version of a register) 0 Stall Instead, the IF.Flush dispatch unit will explicitly tell the dependent instruction who to 0 0 1 receive data Control from Branch 4 +. IRWrite rs rt 5 5 HDU Read Reg. 1 # Read Reg. 2 # Write Reg. # Write Data Read data 1 Read data 2 Register File Ex Mem WB 0 1 2 0 1 2 ALUSelA 0 1 Sh. Left 2 ALU + 0 1 Zero Res. MemRead & MemWrite 0 1 MemToReg Reset Sign Extend 16 32 rs rt rd ALUSelB Forwarding Unit ALUSrc 0 1 Regwrite & WriteReg# Regwrite, WriteReg#

22 Tomasulo s Plan OoO Execution Multiple functional units Integer ALU, Data memory, Multiplier, Divider Queues between ID and EX stages (in place of ID/EX register) Allows later instructions to keep issuing even if earlier ones are stalled

23 OoO Execution Problems For the time No branch prediction No speculative execution beyond a branch So we simply stall on a conditional branch For the time, no support for precise exceptions Even then

24 RAW, WAR, and WAW RAW = Read After Write lw, 40() add $9,, WAR = Write After Read add $9,, say is not available yet lw, 40() WAW = Write After Write add $9,, say is not available yet lw $9, 40() Why would anyone produce one result in $9 without utilizing that result? Why would he overwrite it with another result? How is this possible?

25 WAW can easily occur How is it possible? In OoO execution, instructions before the branch and after the branch can co-exist Consider multiple iterations of a loop body for(i=max; i!= 0; i--) A[i] = A[i] * 3; L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1 Original Code L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1 L1: lw, 40() mult,, sw, 40() addi,,-4 bne, $0,L1

26 Another WAW Example Say a company gives standard bonus to most of the employees and a higher bonus to managers The software may set a default value to the standard bonus and then overwrite for the special case int x = standard_bonus; if (manager) x = special_bonus; set_bonus(x);

Name Depdencies 27 RAW, WAR, and WAW Some terminology to remember RAW = Read After Write lw, 40() add $9,, WAR = Write After Read add $9,, lw, 40() WAW = Write After Write add $9,, lw $9, 40() RAW A true dependency WAR An anti-dependency WAW An anti-dependency Note: no information is communicated in WAR/WAW hazards

28 RAW, WAR, and WAW In-order execution: We need to deal with RAW only Out-of-order execution Now we need to deal with WAR and WAW hazards besides RAW Any of these hazards seem to prevent re-ordering instructions and executing them out-of-order

29 Register Renaming We have limited architectural registers Registers the instruction set is aware of We could have more physical registers Actual registers part of the register file Assume Delayed lw, 40() sw, 40() lw, 60() sw, 60() It is clear the compiler is using as a temporary register If there is a delay in obtaining the first part of the code cannot proceed Unfortunately, the second part of the code cannot proceed because of the name dependency for

30 Register Renaming If we had 64 registers instead of 32 registers, then perhaps the compiler might have used 8 instead of and we could have executed the second part of the code before the first part lw, 40() sw, 40() lw 8, 60() add 8, 8, 8 sw 8, 60() This is an example of a name-dependency

31 Register Renaming Four different temporary registers can be used here as shown:, 8, 8 and 8 We could even simply use aliases or code names: LION, TIGER, CAT, BEAR lw, 40() add 8,, sw 8, 40() lw LION, 40() add TIGER, LION, LION sw TIGER, 40() lw 8, 60() add 8, 8, 8 sw 8, 60() lw CAT, 60() add BEAR, CAT, CAT sw BEAR, 60()

32 Increasing Number of Registers Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled code? Answer: Yes / No NO Why? Machine code has 5-bit fields for register ID s

33 Increasing Number of Registers Answer: Cannot change the number of architectural registers Instead we will perform Register Renaming through Tagging Registers This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues Please be sure you understand this!

34 Tagging process sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val INT MUL/DIV/SQRT Load/ Store

35 Tagging process: CC1 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val INT MUL/DIV/SQRT Load/ Store

36 Tagging process: CC2 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T2 RF 1 1 Issue Logic RST = Register Status Table RF = Register File INT ALU T1: SQRT Val / 0 Val T2: LW T1 / 40 INT MUL/DIV/SQRT Load/ Store

37 Tagging process: CC3 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T2 T3 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val T2: LW T1 / 40 INT ALU INT MUL/DIV/SQRT Load/ Store

38 Tagging process: CC4 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T3 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store

39 Tagging process: CC5 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T3 T4 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store

40 Tagging process: CC6 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T4 T5 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T5: ADD T4 / T4 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store

41 Tagging process: CC7 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T5 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T5: ADD T4 / T4 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT T4: LW val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T4: Read 0x1111

42 Tagging process: CC8 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 T5 => null RF 0x2222 1 1 T5: Sum 0x2222 Issue Logic RST = Register Status Table RF = Register File T5: ADD 0x1111 / 0x1111 T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T5: Sum 0x2222

43 Tagging process: CC9 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 null RF 0x2222 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW 0x2222, val / 60 SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T5: Sum 0x2222

44 Tagging process: CC10 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST T1 => null RF 0x2222 1 1 T1: SQRT 0xacd0 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT SW T3 / T1 / 40 T2: LW T1 / 40 Load/ Store T1: SQRT 0xacd0

45 Tagging process: CC11 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x2222 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD T2 / T2 INT ALU INT MUL/DIV/SQRT SW T3 / 0xacd0 / 40 T2: LW 0xacd0 / 40 Load/ Store T2: Read 0x5678

46 Tagging process: CC12 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x2222 1 1 Issue Logic RST = Register Status Table RF = Register File T3: ADD 0x5678 / 0x5678 INT ALU INT MUL/DIV/SQRT SW T3 / 0xacd0 / 40 Load/ Store T3: Sum 0xACF0

47 Tagging process: CC13 sqrt, 0 lw, 40() sw, 40() lw, 60() sw, 60() RST RF 0xacd0 0x2222 1 1 Issue Logic RST = Register Status Table RF = Register File SW 0xacf0 / 0xacd0 / 40 INT ALU INT MUL/DIV/SQRT Load/ Store

48 Tomasulo s Algorithm Dispatch/Issue unit decodes and dispatches instructions For destination operand, an instruction carries a TAG (but not the actual register name) For source operands, an instruction carries either the values (if TAG is null in RST) or TAGs of the operands (but not the actual register name) When an instruction executes and produces a result it broadcasts the result and its destination TAG Any instruction waiting can compare its SRC tags with the destination tag and grab the value if they match If entry in RST matches the TAG then this instruction is the latest producer of the register and the value will be written to the RF

49 Register Renaming sqrt, 0 add,, add,, add,, add,, RST T1, T2, T3, T4 RF 1 1 Issue Logic RST = Register Status Table RF = Register File T4: ADD T3 / T3 T3: ADD T2 / T2 T2: ADD T1 / T1 T1: SQRT Val / 0 Val INT ALU INT MUL/DIV/SQRT Load/ Store

50 Unique TAGs Like SSN, we need a unique TAG SSN s are reused. Similarly TAGS can be reused TAGs are similar to number TOKEN Helps to create a virtual queue. We do not need that here In State Bank of India, the cashier issues brass token to customers trying to draw money as an ID (and not at all to put them in any virtual queue / ordering). Token numbers are in random order. The cashier verifies the signature in the record rooms, returns with money, calls the token number and issues the money. Tokens are reclaimed & reused.

51 Tags (= Tokens) How many tokens should the bank casheir have to start with? What happens if the tokens run out? Does the cashier need to have any order in holding tokens and issuing tokens? Do they have to collect the tokens back?

52 TAG FIFO To issue and collect tokens (TAGS) use a circular FIFO (First- In/First-Out) unit While the FIFO order is not important here, a FIFO is the easiest to implement in hardware compared to a random order in a pile Filled (with say) 64 tokens (in any order) initially on reset Tokens return in any order FIFO s are taught in EE 560 Put tokens back in the FIFO and reissue TAG FIFO TAG FIFO TAG FIFO wp 0 1 2 63 rp wp 2 63 rp wp 1 2 63 rp FULL 2 Tokens issued 1 Tokens returned

Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 53 Organization for OoO Execution I-Cache Instruc. Queue TAG FIFO Block Diagram Adapted from Prof. Michel Dubois (Simplified for EE 457) Register Status Table Dispatch Integer / Branch D-Cache Div Mul Issue Unit CDB

54 Front-End & Back-End IFQ (Instruction Fetch Queue) A FIFO structure Dispatch (Issue) Unit Includes RST, RF, Tag FIFO Load/Store and other Issue Queues Issue Units Functional units CDB (Common Data Bus) Like a public address system that everyone can see/hear when data is produced

55 More Tomasulo Algorithm Front End Instructions are fetched They are stored in a FIFO (IFQ) When instruction reached the head of the IFQ it is Decoded Dispatched to an issue queue/functional unit Even if some of the inputs are not ready (takes TAGs) Back End Instructions in issue queues wait for their input operands Once register operands are ready instructions can be scheduled for execution provided they will not conflict for the CDB or their functional unit Instructions execute in their functional unit and their result is put on the CDB All instructions in queues and the register file watch the CDB and grab the value they are waiting for when it is produced

56 Bottleneck in Tomasulo s Approach The CDB!!! Do all instructions use the CDB? No, not SW, J (jump), BEQ

57 Load/Store Queue (LSQ) Performs Address calculation Memory disambiguation RAW, WAR, WAW hazards due to memory reads and writes // Is there a dependency here? SW,0() LW,0() // What about here? SW, 1000() LW, 0()

58 Address Calculation for LW/SW EE 557 approach for address calculation Loads & store in 2 sub-instructions 1 instruction computes address and is dispatched to integer ALU 1 instruction access data cache and is issued to LSQ Address is communicated from integer ALU to LSQ via CDB forwarding using a tag EE 560/457 approach Use a dedicated adder in the LSQ to compute address (so just 1 dispatched instruction)

59 Memory Disambiguation EE 557: Issue to a cache from LSQ Loads can issue to a cache when their address is ready Stores can issue to cache when both address & data is ready Memory hazards (RAW, WAR, WAW) are resolved in the LSQ Load can issue to cache if no store with same address is before it Store can issue to cache if no store or load with same address before it Otherwise access waits in LSQ If an address is unknown it is assumed to be the same Worst case to enforce correctness The process of figuring out and comparing memory address is called disambiguation

60 Memory Disambiguation RAW sw, 2000($0) lw, 2000($0) This later lw can proceed only if there is no store ahead of it with the same address WAW sw, 2000($0) sw, 2000($0) This later sw can proceed only if there is no store ahead of it with the same address WAR lw, 2000($0) sw, 2000($0) This later sw can proceed only if there is no load ahead of it with the same address

61 LSQ Ordering Maintaining instructions in the order of arrival Issue order/program order in a queue Is this necessary and/or desirable? In the case of LSQ? Necessary! To enforce memory disambiguation In the case of Integer, MUL, DIV queues? Desirable, so that an earlier instruction gets executed whenever possible, thereby reducing queue pressure from too many instructions waiting on it

62 Priority Priority (based on the order of arrival among ready instructions) Is it necessary or is it desirable? Local priority within queues? Global priority across the queues?

63 Issue Unit CDB availability constraint Pipelined functional unit vs. multicycle unit Conflict resolution Round-robin priority adequate?, well,

64 Conditional Branches Dispatch unit stops dispatching until the branch is resolved CDB broadcasts the result of the branch Dispatching continues at the target or fall-through instruction Successful branch shall cause flushing of IFQ Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end? Is it possible that the back-end holds simultaneously A. Some instructions dispatched before the branch and B. Some instructions issued after the branch

65 Structural & Control Hazards Structural Stalls I-Fetch must stall if the IFQ is full Dispatch must stall if all entries in the desired functional unit s Issue Queue is occupied Instructions cannot be scheduled in case of conflicts for the CDB or functional unit Control Dispatcher stalls when it reaches a branch Branches are dispatched to integer queue They wait for their operands if needed Put their result on CDB If untaken, dispatch unit resumes If taken, then dispatch clears the IFQ and resumes at target Precise exceptions not supported

BACKUP 66

67 Tagging Registers: CC1 DOG sqrt, 0 lw, 40() Orange means dispatched and SQRT is a long-latency computation RST DOG RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

68 Tagging Registers: CC2 sqrt, 0 LION DOG lw, 40() DOG Orange means dispatched and SQRT is a long-latency computation RST DOG LION RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

69 Tagging Registers: CC3 sqrt, 0 LION lw, 40() TIGER DOG LION DOG LION Orange means dispatched and SQRT is a long-latency computation RST DOG TIGER RF sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

70 Tagging Registers: CC4 sqrt, 0 LION lw, 40() TIGER DOG LION DOG LION Orange means dispatched and SQRT is a long-latency computation RST DOG TIGER RF TIGER sw, 40() 1 1 lw, 60() sw, 60() Destination Dependent source RST = Register Status Table RF = Register File

71 Tagging Registers Review sqrt, 0 LION lw, 40() TIGER TIGER DOG LION sw, 40() lw, 60() sw, 60() DOG LION Dispatch unit decodes and dispatches instructions For destination operand, an instruction carreis a TAG (but not the actual register name) For source operands, an instruction carries either the values (if no TAG in RST) or TAGs of the operands (but not the actual register name) When 1 RST DOG TIGER 1 RF

Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 72 Organization for OoO Execution I-Cache Instruc. Queue TAG FIFO Block Diagram Adapted from Prof. Michel Dubois (Simplified for EE 457) Register Status Table Dispatch Integer / Branch D-Cache Div Mul Issue Unit CDB

73 Multiple Functional Units We now provide multiple functional units After decode, issue to a queue, stalling if the unit is busy or waiting for data dependency to resolve Queues + Functional Units ALU MUL IM Reg DM Reg DIV DM (Cache)

74 Functional Unit Latencies Int. ALU, Addr. Calc. EX An added complication of out-of-order execution & completion: WAW & WAR hazards FP Add A1 A2 A3 A4 Int. & FP MUL M1 M2 M3 M4 M5 M6 M7 Int. & FP DIV Look Ahead: Tomasulo Algorithm will help absorb latency of different functional units and cache miss latency by allowing other ready instruction proceed out-of-order Functional Unit Latency (Required stalls cycles between dependent [RAW] instrucs.) Initiation Interval (Distance between 2 independent instructions requiring the same FU) Integer ALU 0 1 FP Add 3 1 FP Mul. 6 1 FP Div. 24 25

Int. Queue L/S Queue Div Queue Mult. Queue Reg. File 75 OoO Execution w/ ROB ROB allows for OoO execution but in-order completion I-Cache D-Cache Instruc. Queue ROB (Reorder Buffer) Br. Pred. Buffer Dispatch Exceptions? No problem Integer / Branch Exec. Unit D-Cache L/S Buffer Addr. Buffer Div Mul Issue Unit CDB