Lecture-13 (ROB and Multi-threading) CS422-Spring

Similar documents
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Handout 2 ILP: Part B

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

TDT 4260 lecture 7 spring semester 2015

Hardware-Based Speculation

Multiple Instruction Issue and Hardware Based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Metodologie di Progettazione Hardware-Software

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

TDT 4260 TDT ILP Chap 2, App. C

Hardware-based Speculation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Adapted from David Patterson s slides on graduate computer architecture

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Exploitation of instruction level parallelism


CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Static vs. Dynamic Scheduling

Instruction-Level Parallelism and Its Exploitation

Processor: Superscalars Dynamic Scheduling

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Super Scalar. Kalyan Basu March 21,

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Advanced issues in pipelining

CS433 Homework 2 (Chapter 3)

EECC551 Exam Review 4 questions out of 6 questions

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CS433 Homework 2 (Chapter 3)

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Instruction Level Parallelism

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

The basic structure of a MIPS floating-point unit

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Instruction Level Parallelism

5008: Computer Architecture

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Four Steps of Speculative Tomasulo cycle 0

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Instruction Level Parallelism. Taken from

Instruction Level Parallelism (ILP)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Parallel Processing SIMD, Vector and GPU s cont.

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Chapter 4 The Processor 1. Chapter 4D. The Processor

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Hardware-Based Speculation

Instruction Level Parallelism

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Superscalar Architectures: Part 2

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

ILP: Instruction Level Parallelism

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Hardware-based Speculation

Course on Advanced Computer Architectures

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Scoreboard information (3 tables) Four stages of scoreboard control

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

CMSC411 Fall 2013 Midterm 2 Solutions

Multi-cycle Instructions in the Pipeline (Floating Point)

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Lecture 19: Instruction Level Parallelism -- Dynamic Scheduling, Multiple Issue, and Speculation

ECSE 425 Lecture 25: Mul1- threading

Transcription:

Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK

Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue ComplResult LD F6 34+ R2 1 2 3 4 1 3 4 LD F2 45+ R3 5 6 7 8 2 4 5 MULTD F0 F2 F4 6 9 19 20 3 15 16 SUBD F8 F6 F2 7 9 11 12 4 7 8 DIVD F10 F0 F6 8 21 61 62 5 56 57 ADDD F6 F8 F2 13 14 16 22 6 10 11 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2

Tomasulo vs Scoreboard Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/ ) (1 load/store, 1 +, 2 x, 1 ) window size: 14 instructions 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard CS422: Spring 2018 Biswabandan Panda, CSE@IITK 3

Registers Functional Units Register Renaming + Scoreboard? FP Mult FP Mult FP Divide FP Add Integer SCOREBOARD Memory Rename Table CS422: Spring 2018 Biswabandan Panda, CSE@IITK 4

Explicit Register Renaming Make use of a physical register file that is larger than number of registers specified by ISA Keep a translation table: ISA register => physical register mapping Physical register becomes free when not being used by any instructions in progress. Fetch Decode/ Rename Execute Rename Table CS422: Spring 2018 Biswabandan Panda, CSE@IITK 5

Explicit Register Renaming includes Rapid access to a table of translations A physical register file that has more registers than specified by the ISA Ability to figure out which physical registers are free. No free registers stall on issue Many modern architectures use explicit register renaming + Tomasulo-like reservation stations to control execution. Relationship between #registers and #RS entries: Piazza +1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 6

What About Exceptions and Interrupts? Both Scoreboard and Tomasulo have: In-order issue, out-of-order execution, out-of-order completion Recall: An interrupt or exception is precise if there is a single instruction for which: All instructions before that have committed their state No following instructions (including the interrupting instruction) have modified any state. Need way to resynchronize execution with instruction stream (I.e. with issueorder) CS422: Spring 2018 Biswabandan Panda, CSE@IITK 7

Re-order Buffer Out-of-order commit: What about precise exceptions? It would be great to have in-order commit with O3 execute. The process of an instruction leaving the ROB will now be called as commit. To preserve precise exceptions, a result is written into the register file only when the instruction commits. Until then, the result is saved in a temporary register in the ROB. CS422: Spring 2018 Biswabandan Panda, CSE@IITK 8

Reg Result Exceptions? Valid Program Counter Re-order Buffer (ROB) FP Op Queue Compar network Reorder Buffer FP Regs Reorder Table Res Stations FP Adder Res Stations FP Adder CS422: Spring 2018 Biswabandan Panda, CSE@IITK 9

Speculative Tomasulo with ROB 1. Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called dispatch ) 2. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ) 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation ) CS422: Spring 2018 Biswabandan Panda, CSE@IITK 10

ROB Entry Type of Instruction ination: None, Memory Address, Register including ROB entry Value and its presence/absence. Reservation station tags and true register tags are now ids of entries in the ROB. CS422: Spring 2018 Biswabandan Panda, CSE@IITK 11

Tomasulo + ROB FP Op Queue Reorder Buffer F0 LD F0,10(R2) N Done? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Oldest Registers To Memory from Memory FP adders Reservation Stations FP multipliers 1 10+R2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 12

Tomasulo + ROB Done? FP Op Queue ROB7 ROB6 ROB5 Newest Reorder Buffer F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N N ROB4 ROB3 ROB2 ROB1 Oldest Registers 2 ADDD R(F4),ROB1 To Memory from Memory FP adders Reservation Stations FP multipliers 1 10+R2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 13

Tomasulo + ROB FP Op Queue Done? ROB7 ROB6 Newest ROB5 Reorder Buffer F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N ROB4 ROB3 ROB2 ROB1 Oldest Registers 2 ADDD R(F4),ROB1 FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 14

Tomasulo + ROB FP Op Queue Reorder Buffer F0 ADDD F0,F4,F6 N F4 LD F4,0(R3) N -- BNE F2,< > N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Done? N N N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Oldest Registers 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 6 0+R3 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 15

Tomasulo + ROB FP Op Queue Reorder Buffer Done? -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y N F4 M[10] LD F4,0(R3) Y -- BNE F2,< > N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Oldest Registers 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6) FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 16

Tomasulo + ROB FP Op Queue Reorder Buffer Done? -- F0 M[10] <val2> ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 M[10] LD F4,0(R3) Y -- BNE F2,< > N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Oldest Registers 2 ADDD R(F4),ROB1 FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 17

Tomasulo + ROB FP Op Queue Reorder Buffer Done? -- F0 M[10] <val2> ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 M[10] LD F4,0(R3) Y -- BNE F2,< > N F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N F0 LD F0,10(R2) N ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Oldest Registers 2 ADDD R(F4),ROB1 FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 18

Limits of ILP CS422: Spring 2018 Biswabandan Panda, CSE@IITK 19

Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming infinite virtual registers all register WAW & WAR hazards are avoided 2. Branch prediction perfect; no mispredictions 3. Jump prediction all jumps perfectly predicted (returns, case statements) 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; CS422: Spring 2018 Biswabandan Panda, CSE@IITK 20

Instructions Per Clock Window Impact on IPC Change from Infinite window 2048, 512, 128, 32 160 150 140 120 119 100 80 60 40 20 63 55 41 36 15 18 15 10 10 13 12 8 8 119 75 61 59 60 49 45 35 34 14 1615 9 14 0 gcc espresso li f pppp doduc tomcatv Inf inite 2048 512 128 32 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 21

IPC Instruction issues per cycle Branch Impact 60 58 FP: 15-45 60 50 48 46 45 46 45 45 40 41 30 35 Integer: 6-12 29 20 10 9 6 16 12 10 6 7 6 6 7 2 2 2 15 13 14 4 19 0 gcc espresso li fpppp doducd tomcatv Program Perfect Tournament Perfect Selective predictor Standard 2-bit Static None BHT (512) Static No prediction CS422: Spring 2018 Biswabandan Panda, CSE@IITK 22

Beyond ILP There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: instruction stream with own PC and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data CS422: Spring 2018 Biswabandan Panda, CSE@IITK 23

TLP Thread Level Parallelism ILP exploits implicit parallel operations within a loop or straight-line code segment TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Goal: Use multiple instruction streams to improve 1. Throughput of computers that run many programs 2. Execution time of multi-threaded programs TLP could be more cost-effective to exploit than ILP CS422: Spring 2018 Biswabandan Panda, CSE@IITK 24

Multiple Threads in Execution Multithreading: multiple threads to share the functional units of 1 processor via overlapping processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table memory shared through the virtual memory mechanisms, which already support multiple processes HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) CS422: Spring 2018 Biswabandan Panda, CSE@IITK 25

Fine-grained Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads Used on Sun s Niagara CS422: Spring 2018 Biswabandan Panda, CSE@IITK 26

Coarse-grained Switches threads only on costly stalls, such as L2 cache misses Advantages Relieves need to have very fast thread-switching Doesn t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time Used in IBM AS/400, Alewife CS422: Spring 2018 Biswabandan Panda, CSE@IITK 27

TLP + ILP TLP and ILP exploit two different kinds of parallel structure in a program Could a processor oriented at ILP to exploit TLP? functional units are often idle in data path designed for ILP because of either stalls or dependences in the code Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists? CS422: Spring 2018 Biswabandan Panda, CSE@IITK 28

SMT One thread, 8 units Cycle 1 2 3 4 5 6 7 8 9 M M FX FX FP FP BR CC Two threads, 8 units Cycle 1 2 3 4 5 6 7 8 9 M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes CS422: Spring 2018 Biswabandan Panda, CSE@IITK 29

SMT Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading Large set of virtual registers that can be used to hold the register sets of independent threads Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in data path without confusing sources and destinations across threads Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW Just adding a per thread renaming table and keeping separate PCs Independent commitment can be supported by logically keeping a separate reorder buffer for each thread CS422: Spring 2018 Biswabandan Panda, CSE@IITK 30

Time (processor cycle) All in One Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot CS422: Spring 2018 Biswabandan Panda, CSE@IITK 31

IBM POWER 4 and 5 Power 4 Power 5 2 commits (architected register sets) 2 fetch (PC), 2 initial decodes CS422: Spring 2018 Biswabandan Panda, CSE@IITK 32

Power 5 Data flow CS422: Spring 2018 Biswabandan Panda, CSE@IITK 33