Hardware-Based Speculation

Similar documents
Hardware-Based Speculation

Static vs. Dynamic Scheduling

Parallel Processing SIMD, Vector and GPU s cont.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Lecture-13 (ROB and Multi-threading) CS422-Spring

TDT 4260 lecture 7 spring semester 2015

Hardware-based Speculation

Copyright 2012, Elsevier Inc. All rights reserved.

ILP: Instruction Level Parallelism

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Adapted from David Patterson s slides on graduate computer architecture

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Four Steps of Speculative Tomasulo cycle 0

Handout 2 ILP: Part B

Super Scalar. Kalyan Basu March 21,


CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

5008: Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Getting CPI under 1: Outline

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Exploitation of instruction level parallelism

EECC551 Exam Review 4 questions out of 6 questions

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

TDT 4260 TDT ILP Chap 2, App. C

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Hardware-based Speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multi-cycle Instructions in the Pipeline (Floating Point)

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Instruction Level Parallelism

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CS 2410 Mid term (fall 2018)

CS425 Computer Systems Architecture

CSE 502 Graduate Computer Architecture

Instruction-Level Parallelism and Its Exploitation

The basic structure of a MIPS floating-point unit

CS425 Computer Systems Architecture

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS 1013 Advance Computer Architecture UNIT I

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Metodologie di Progettazione Hardware-Software

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

3.16 Historical Perspective and References

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

UNIT I (Two Marks Questions & Answers)

Advanced issues in pipelining

Dynamic Scheduling. CSE471 Susan Eggers 1

Computer Science 246 Computer Architecture

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture 9: Multiple Issue (Superscalar and VLIW)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

Simultaneous Multithreading Architecture

ECSE 425 Lecture 25: Mul1- threading

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

RECAP. B649 Parallel Architectures and Programming

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Processor (IV) - advanced ILP. Hwansoo Han

HY425 Lecture 09: Software to exploit ILP

CS433 Homework 2 (Chapter 3)

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Course on Advanced Computer Architectures

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

HY425 Lecture 09: Software to exploit ILP

CSE 490/590 Computer Architecture Homework 2

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Static Compiler Optimization Techniques

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Transcription:

Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative Need an additional piece of hardware to prevent any irrevocable action until an instruction commits Need to separate executing the instruction to pass data to other instructions from completing (performing operations that can not be undone) Reorder Buffer Reorder buffer holds the result of instruction between completion and commit (and supply them to any instruction who needs them just like the RS in Tomasulo s) Four fields: Instruction type: branch/store/register Destination field: register number or memory address Value field: output value Ready field: completed execution? Modify reservation stations: Operand source is now reorder buffer instead of functional unit (results are tagged with ROB entry #) 1 Prediction Prediction 2Branch

Reorder Buffer Register values and memory values are not written until an instruction commits On misprediction: Speculated entries in ROB are cleared Exceptions: Not recognized until it is ready to commit 4 stages Issue Execute Write Result Commit Reorder Buffer Issue If empty RS and ROB entry Issue; else stall Send operands to RS if available in registers or ROB The number of the ROB entry allocated to instruction is sent to RS to tag the results with If operands are not available yet, the ROB entry is sent to the RS to wait for results on the CDB 3 4Branch Prediction

Reorder Buffer Execute If one or more operands are not available, monitor the CDB. When the result is broadcast on the CDB (we know that from the ROB entry tag) copy it When all operands are ready, start execution Write Result When execution is completed, broadcast the result on the CDB tagged with ROB entry # Results are copied to ROB entry and all waiting RS Execute out of order, commit in order. 5 Reorder buffer When an instruction reaches the head of the ROB and the result is ready in the buffer, If ALU op write it to the register file and remove instruction from ROB If the instruction is a store, write it to the memory and remove the instruction from the ROB If the instruction is a branch, if prediction is correct, remove it from the ROB. If misprediction flush the ROB and start from the correct successor. 6

Multiple Issue and Static Scheduling To achieve CPI < 1, need to complete multiple instructions per clock Solutions: Statically scheduled superscalar processors VLIW (very long instruction word) processors dynamically scheduled superscalar processors Multiple Issue 7 Issue and Static Scheduling Issue and Static Scheduling 8Multiple

VLIW Processors Package multiple operations into one instruction Example VLIW processor: One integer instruction (or branch) Two independent floating-point operations Two independent memory references Multiple Issue and Static Scheduling Must be enough parallelism in the code to fill the available slots 9 VLIW Processors Disadvantages: Statically finding parallelism Code size No hazard detection hardware Binary code compatibility Multiple Issue and Static Scheduling 10

VLIW Example Source instruction Instruction using result Latency FP ALU OP FP ALU OP 3 FP ALU OP Store double 2 Load double FP ALU OP 1 Load Double Store double 0 Loop: L.D ADD.D S.D DADDUI BNE R F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#-8 1,R2,Loop For (I=1000;I>0;I++) x[i]=x[i]+s; 11 VLIW Example Assume that w can schedule 2 memory operations, 2 FP operations, and one integer or branch Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 DADD R1,R1,#-56 7 SD 24(R1),F20 SD 16(R1),F24 8 SD 8(R1),F28 BNEZ R1,LOOP 9 12

Dynamic Scheduling, Multiple Issue, and Speculation Modern microarchitectures: Dynamic scheduling + multiple issue + speculation Two approaches: Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Hybrid approaches Issue logic can become bottleneck Dynamic Scheduling, Multiple Issue, and Speculation 13 Overview of Design Dynamic Scheduling, Multiple Issue, and Speculation 14

Multiple Issue Limit the number of instructions of a given class that can be issued in a bundle I.e. one FP, one integer, one load, one store Examine all the dependencies among the instructions in the bundle If dependencies exist in bundle, encode them in reservation stations Also need multiple completion/commit Dynamic Scheduling, Multiple Issue, and Speculation 15 Example Loop: LD R2,0(R1) ;R2=array element DADDIU R2,R2,#1 ;increment R2 SD R2,0(R1) ;store result DADDIU R1,R1,#8 ;increment pointer BNE R2,R3,LOOP ;branch if not last element Dynamic Scheduling, Multiple Issue, and Speculation 16

Example (No Speculation) Dynamic Scheduling, Multiple Issue, and Speculation 17 Example Dynamic Scheduling, Multiple Issue, and Speculation 18

Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute 19 From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. 20

Thread Level parallelism Multithreading: multiple threads to share the functional units of 1 processor via overlapping processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table memory shared through the virtual memory mechanisms, which already support multiple processes HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When to switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) 21 Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads Used on Sun s T1 22

Coarse-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages Need to have very fast thread-switching Doesn t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time 23 Simultanuous Multithreading SMT Fine-grained multithreading implemented on top of multiple-issued dynamically scheduled processor. Multiple instructions from different threads. 24

SMT One thread, 8 Units 1 2 3 4 5 6 7 8 9 Two threads, 8 Units 1 2 3 4 5 6 7 8 9 25 Time (processor cycle) Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 26

Sun T1 Focused on TLP rather than ILP Fine-grained multithreading 8 cores, 4 threads per core, one shared FP unit. 6-stage pipeline (similar to MIPS with one stage for thread switching) L1 caches: 16KB I, 8KB D, 64-byte block size (misses to L2 23 cycles with no contention) L2 caches: 4 separate L2 caches each 750KB. Misses to main memory 110 cycles assuming no contention 27 Sun T1 Relative change in the miss rate and latency when executing one thread per core vs 4 threads per core (TPC-C) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 L1 I miss rate L1 D miss rate L2 miss rate L1 I miss latency L1 D miss latency L2 miss latency 28

Sun T1 Breakdown of the status on an average thread. Ready means the thread is ready, but another one is chosen The core stalls only if all the 4 threads are not ready 120% 100% 80% 60% 40% Not ready Ready Executing 20% 0% TPC C SPECJBB00 SPECWeb99 29 Sun t1 Breakdown of the causes for a thread being not ready 120% 100% 80% 60% 40% 20% Other Pipeline delay L2 miss L1 D miss L1 I miss 0% TPC C SPECJBB00 SPECWeb99 30