Handout 2 ILP: Part B

Similar documents
CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture-13 (ROB and Multi-threading) CS422-Spring

5008: Computer Architecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Hardware-based Speculation

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Simultaneous Multithreading Architecture

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Adapted from David Patterson s slides on graduate computer architecture

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

The basic structure of a MIPS floating-point unit

Hardware-Based Speculation

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Hardware-Based Speculation

Static vs. Dynamic Scheduling

Chapter 4 The Processor 1. Chapter 4D. The Processor

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Hardware-based Speculation

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Super Scalar. Kalyan Basu March 21,

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

TDT 4260 lecture 7 spring semester 2015

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Outline EEL 5764 Graduate Computer Architecture. Chapter 2 - Instruction Level Parallelism. Recall from Pipelining Review

Advanced issues in pipelining

Copyright 2012, Elsevier Inc. All rights reserved.

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Four Steps of Speculative Tomasulo cycle 0

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

EECC551 Exam Review 4 questions out of 6 questions

Instruction Level Parallelism

Instruction Level Parallelism

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Course on Advanced Computer Architectures

Getting CPI under 1: Outline

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Metodologie di Progettazione Hardware-Software

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Processor: Superscalars Dynamic Scheduling

Multiple Instruction Issue and Hardware Based Speculation

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

ILP: Instruction Level Parallelism

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Exploitation of instruction level parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

LIMITS OF ILP. B649 Parallel Architectures and Programming


吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

CS433 Homework 2 (Chapter 3)

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Instruction-Level Parallelism and Its Exploitation

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

E0-243: Computer Architecture

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Chapter. Out of order Execution

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Multithreaded Processors. Department of Electrical Engineering Stanford University

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Superscalar Architectures: Part 2

TDT 4260 TDT ILP Chap 2, App. C

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CSE 502 Graduate Computer Architecture

ECE 505 Computer Architecture

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Dynamic Scheduling. CSE471 Susan Eggers 1

Transcription:

Handout 2 ILP: Part B

Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP Dynamic HW exploiting ILP Works when can t know dependence at compile time Can hide L1 cache misses Code for one machine runs well on another 2017/10/11 2

Review from Last Time #2 Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW ot limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron, 2017/10/11 3

Outline ILP Speculation Speculative Tomasulo Example Exceptions Issue Window vs. Reorder Buffer Simultaneous Multi-threading 2017/10/11 4

Speculation to greater ILP Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct Speculation fetch, issue, and execute instructions as if branch predictions were always correct Dynamic scheduling deal with instruction scheduling Essentially a data flow execution model: Operations execute as soon as their operands are available 2017/10/11 5

Speculation to greater ILP 3 components of HW-based speculation: 1. Dynamic branch prediction to choose which instructions to execute 2. Speculation to allow execution of instructions before control dependences are resolved + ability to undo effects of incorrectly speculated sequence 3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks 2017/10/11 6

Adding Speculation to Tomasulo Must separate execution from allowing instruction to finish or commit This additional step called instruction commit When an instruction is no longer speculative, allow it to update the register file or memory Requires additional set of buffers to hold results of instructions that have finished execution but have not committed This reorder buffer (ROB) is also used to pass results among instructions that may be speculated 2017/10/11 7

Reorder Buffer (ROB) In Tomasulo s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file With speculation, the register file is not updated until the instruction commits (we know definitively that the instruction should execute) Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo s algorithm ROB extends architectured registers like RS 2017/10/11 8

Reorder Buffer Entry Each entry in the ROB contains four fields: 1. Instruction type a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations) 2. ination Register number (for loads and ALU operations) or memory address (for stores) where the instruction result should be written 3. Value Value of instruction result until the instruction commits 4. Ready Indicates that instruction has completed execution, and the value is ready 2017/10/11 9

Reorder Buffer operation Holds instructions in FIFO order, exactly as issued When instructions complete, results placed into ROB Supplies operands to other instruction between execution complete & commit more registers like RS Tag results with ROB buffer number instead of reservation station Instructions commit values at head of ROB placed in registers As a result, easy to undo speculated instructions on mispredicted branches or on exceptions Commit path FP Op Queue Reorder Buffer FP Regs Res Stations FP Adder Res Stations FP Adder 2017/10/11 10

Recall: 4 Steps of Speculative Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called dispatch ) 2. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ) 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation ) 2017/10/11 11

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer F0 LD F0,10(R2) Done? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest Registers To Memory from Memory FP adders Reservation Stations FP multipliers 1 10+R2 2017/10/11 12

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer F10 F0 ADDD F10,F4,F0 LD F0,10(R2) Done? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest Registers 2 ADDD R(F4),ROB1 To Memory from Memory FP adders Reservation Stations FP multipliers 1 10+R2 2017/10/11 13

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Done? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest Registers 2 ADDD R(F4),ROB1 FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 2017/10/11 14

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer F0 ADDD F0,F4,F6 F4 LD F4,0(R3) -- BE F2,< > F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Done? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest Registers 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 5 0+R3 2017/10/11 15

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer Done? -- F0 ROB5 ST 0(R3),F4 ADDD F0,F4,F6 F4 LD F4,0(R3) -- BE F2,< > F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest Registers 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 5 0+R3 2017/10/11 16

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer Done? -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y F4 M[10] LD F4,0(R3) Y -- BE F2,< > F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest Registers 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6) FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 2017/10/11 17

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer Done? -- F0 M[10] <val2> ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 M[10] LD F4,0(R3) Y -- BE F2,< > F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest Registers 2 ADDD R(F4),ROB1 FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers To Memory from Memory 1 10+R2 2017/10/11 18

Avoiding Memory Hazards WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending RAW hazards through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a ination field that matches the value of the A field of the load, and 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data 2017/10/11 19

Exceptions and Interrupts IBM 360/91 invented imprecise interrupts Computer stopped at this PC; its likely close to this address ot so popular with programmers Also, what about Virtual Memory? (ot in IBM 360) Technique for both precise interrupts/exceptions and speculation: in-order completion and inorder commit If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly This is exactly same as need to do with precise exceptions Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB If a speculated instruction raises an exception, the exception is recorded in the ROB This is why reorder buffers are in all new processors 2017/10/11 20

FP Op Queue Tomasulo With Reorder buffer: Reorder Buffer What about memory hazards??? Registers 2 ADDD R(F4),ROB1 FP adders Reservation Stations 3 DIVD ROB2,R(F6) FP multipliers Done? -- F0 M[10] <val2> ST 0(R3),F4 ADDD F0,F4,F6 Y Ex F4 M[10] LD F4,0(R3) Y -- BE F2,< > F2 DIVD F2,F10,F6 F10 ADDD F10,F4,F0 F0 LD F0,10(R2) To Memory from Memory 1 10+R2 ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ewest Oldest 2017/10/11 21

Speculation: Register Renaming vs. ROB Alternative to ROB is a larger physical set of registers combined with register renaming Extended registers replace function of both ROB and reservation stations Instruction issue maps names of architectural registers to physical register numbers in extended register set On issue, allocates a new unused register for the destination (which avoids WAW and WAR hazards) Speculation recovery easy because a physical register holding an instruction destination does not become the architectural register until the instruction commits Most Out-of-Order processors today use extended registers with renaming 2017/10/11 22

Reservation Station and ROB Dispatch 6 uops Fetch/decode 4 instructions 128 reservation station entries Intel ehalem Microarchitecture 23

2 Issue Window-Based OOO Processor Architecture add ldr r1, r2, r2, [r2] r3 add r1, r2, r3, r2, #1 #4 Instruction Fetch and Decode Register Rename Issue window Execution unit Execution unit Execution unit Execution unit

Issue window based Dispatch 4 instructions per clock cycle, fetch and decode 2 instructions Cortex A9 Microarchitecture(single core variant) 25

Performance beyond single thread ILP There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data 2017/10/11 26

Thread Level Parallelism (TLP) ILP exploits implicit parallel operations within a loop or straight-line code segment TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Goal: Use multiple instruction streams to improve 1. Throughput of computers that run many programs 2. Execution time of multi-threaded programs TLP could be more cost-effective to exploit than ILP 2017/10/11 27

Do both ILP and TLP? TLP and ILP exploit two different kinds of parallel structure in a program Could a processor oriented at ILP to exploit TLP? functional units are often idle in data path designed for ILP because of either stalls or dependences in the code Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists? 2017/10/11 28

Cycle Simultaneous Multi-threading... One thread, 8 units M M FX FX FP FP BR CC 1 1 Two threads, 8 units Cycle M M FX FX FP FP BR CC 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 9 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

Simultaneous Multithreading (SMT) Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading Large set of virtual registers that can be used to hold the register sets of independent threads Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW Just adding a per thread renaming table and keeping separate PCs Independent commitment can be supported by logically keeping a separate reorder buffer for each thread 2017/10/11 30

Time (processor cycle) Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 2017/10/11 31

Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.

Power 4 Instru cache Power 5 2 commits (architected register sets) 2 fetch (PC), 2 initial decodes

Power 5 data flow... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

Design Challenges in SMT Since SMT makes sense only with fine-grained implementation, impact of fine-grained scheduling on single thread performance? A preferred thread approach sacrifices neither throughput nor single-thread performance? Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls Larger register file needed to hold multiple contexts ot affecting clock cycle time, especially in Instruction issue - more candidate instructions need to be considered Instruction completion - choosing which instructions to commit may be challenging Ensuring that cache and TLB conflicts generated by SMT do not degrade performance 2017/10/11 35

In Conclusion Interrupts and Exceptions either interrupt the current instruction or happen between instructions Possibly large quantities of state must be saved before interrupting Machines with precise exceptions provide one single point in the program to restart execution All instructions before that point have completed o instructions after or including that point have completed Hardware techniques exist for precise exceptions even in the face of out-of-order execution! Important enabling factor for out-of-order execution 2017/10/11 36