UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Similar documents
Hardware-based Speculation

Multiple Instruction Issue and Hardware Based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture-13 (ROB and Multi-threading) CS422-Spring

Hardware-Based Speculation

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Super Scalar. Kalyan Basu March 21,

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Hardware Speculation Support

5008: Computer Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Instruction Level Parallelism

Instruction Level Parallelism

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Instruction Level Parallelism (ILP)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

The basic structure of a MIPS floating-point unit

Copyright 2012, Elsevier Inc. All rights reserved.

Handout 2 ILP: Part B

Exploitation of instruction level parallelism

Four Steps of Speculative Tomasulo cycle 0

CS422 Computer Architecture

Instruction-Level Parallelism and Its Exploitation

Static vs. Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Dynamic Control Hazard Avoidance

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Getting CPI under 1: Outline

ILP: Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

TDT 4260 lecture 7 spring semester 2015

Advanced issues in pipelining

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Chapter 4 The Processor 1. Chapter 4D. The Processor

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EECC551 Exam Review 4 questions out of 6 questions

Hardware-Based Speculation

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CS425 Computer Systems Architecture

Metodologie di Progettazione Hardware-Software

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Advanced Computer Architecture

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Hardware-based Speculation

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Adapted from David Patterson s slides on graduate computer architecture

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Multiple Instruction Issue. Superscalars

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Processor: Superscalars Dynamic Scheduling

CMSC411 Fall 2013 Midterm 2 Solutions

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Course on Advanced Computer Architectures

E0-243: Computer Architecture

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS433 Homework 2 (Chapter 3)

RECAP. B649 Parallel Architectures and Programming

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS433 Homework 2 (Chapter 3)

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Week 6 out-of-class notes, discussions and sample problems

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Lecture 9: Multiple Issue (Superscalar and VLIW)

Dynamic Scheduling. CSE471 Susan Eggers 1

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Instruction Level Parallelism. Taken from

Transcription:

Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1

Last time: Tomasulo s Algorithm Inf3 Computer Architecture - 2017-2018 2

Last time: Summary of Tomasulo s Advantages Register renaming: No need to wait on WAR and WAW (not true dependencies) Can have many more reservation stations than registers Parallel release of all dependent instructions as soon as the earlier instructions completes Common Data Bus (CDB ) is a forward mechanism Limitations Branches stall execution of later instructions until branch is resolved This effectively limits reorder window to the current basic block (4-6 insts) Extending Tomasulo s beyond just floating point operations introduces the risk of imprecise exceptions Inf3 Computer Architecture - 2017-2018 3

Multiple-Issue Processors: Motivation Ideal processor: CPI of 1 no hazards, 1-cycle memory latency Realistic processor: CPI ~1 Dynamic scheduling avoids WAR & WAW dependencies Branch prediction avoids control flow dependencies Caches minimize AMAT Question: can we do better than that??? Inf3 Computer Architecture - 2017-2018 4

Multiple-Issue Processors Answer: Yes! start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Superscalar: instructions are chosen dynamically by the hardware VLIW (Very Long Instruction Word): instructions are chosen statically by the compiler (and assembled in a single long instruction ) Inf3 Computer Architecture - 2017-2018 5

Superscalar Processors Hardware attempts to issue up to n instructions on every cycle, where n is the issue width of the processor and the processor is said to have n issue slots and to be a n-wide processor Instructions issued must respect data dependences In some cycles not all issue slots can be used Extra hardware is needed to detect more combinations of dependences and hazards and to provide more bypasses Branches? With branch prediction, we can predict branches and fetch instructions Can we execute such predicted instructions? Inf3 Computer Architecture - 2017-2018 6

Speculative Execution Speculative execution execute control-dependent instructions even when we are not sure if they should be executed Hardware undo, in case of a misprediction Software recovery too costly, performance-wise Key Idea: Execute out-of-order but commit in order Commit: the results and side-effects (e.g., flags, exceptions) of an instruction are made visible to the rest of the system Tomasulo + multi-issue + speculation Foundation for today s high-performance processors Inf3 Computer Architecture - 2017-2018 7

Extending Tomasulo to Support Speculation Approach: buffer result until instruction ready to commit (i.e., known to be non-speculative) Use buffered result for forwarding to dependent instructions Discard buffered result if the instruction is on a misspeculated execution path At commit, write buffered result to register or memory Decouples execution (potentially speculative) from update of architecturally-visible state (nonspeculative) Architecturally-visible state: registers (R0-Rn, F0-Fn, memory, etc.) Inf3 Computer Architecture - 2017-2018 8

Enabling Speculation with the Reorder Buffer New structure: Reorder Buffer (ROB) Holds completed results until commit time Organized as a queue ordered by program (i.e., fetch) order Takes over the role of the reservation stations for tracking dependencies and bypassing values Accessed by dependent instructions for forwarding of completed, but not-yet-committed, results Reservation stations still needed to hold issued instructions until they begin execution Flushed once mis-speculation is discovered (mispredicted branch commits) Enables precise exceptions exception state recorded in ROB flushed if exception occurred on a mis-predicted path Inf3 Computer Architecture - 2017-2018 9

Precise Exceptions Precise Exceptions require that the architecturally-visible state is consistent with sequential (one instruction at a time) execution Architecturally-visible state: registers (R0-Rn, F0-Fn) & memory Implication: all instructions before the excepting instruction are committed and those after it can be restarted from scratch (i.e., have not modified architectural state). Speculation without support for precise exceptions can have nasty consequences E.g., I/O on a misspeculated path E.g., program terminated after accessing memory it doesn t have permissions to access, but does so on a misspeculated path Other benefits of precise exceptions: Software debugging, simple exception recovery, easy context switching Inf3 Computer Architecture - 2017-2018 10

Enabling Speculation with the Reorder Buffer Instructions are fetched in order from the Instruction Cache Instructions are executed out-of-order (with Tomasulo s) Instructions are committed in order (with the ROB) Enable precise exceptions and speculative execution Instruction Cache Register File Reorder Buffer Out-of-Order Execution Inf3 Computer Architecture - 2017-2018 11

Tomasulo with Hardware Speculation Issue: Get instruction from queue Issue if an RS is free and an ROB entry is also free Stall if no RS or no free ROB entry Instructions now tagged with ROB entry number, not RS.id Execute: Same as before: monitor CDB and start instruction when operands are available Write Result: CDB broadcasts result with ROB identifier ROB captures result to commit later Store operations also saved in the ROB until store data is available and store instruction is committed Commit: If branch, check prediction and squash following instructions if incorrect If store, send data and address to memory unit and perform write action Else, update register with new value and release ROB entry Address unit Address unit 6... 11 Memory unit From instruction fetch unit Instruction Queue Load buffers st f4, 8(r2) add f4, f5, f3 mul f3, f1, f2 ld f1, 4(r1) 1 2 3 FP adders 4 5 Reservation stations Re- Order Buffer FP registers FP multipliers Inf3 Computer Architecture - 2017-2018 13

In-order vs Out-of-order Superscalars Compared Intel SandyBridge 4-wide decode/issue 6-wide execute Cavium MIPS core 2-wide in-order Source: AnandTech Inf3 Computer Architecture - 2017-2018 14

State-of-the-Art in Out-of-Order Superscalars Intel Haswell (circa 2014-15): Large Instruction Window: 192 entries Deep Load & Store buffers: 72 load, 42 store Deeper OoO scheduling window: 60 entries Execution ports: 8 (more execution resources for store address calculation, branches and integer processing). Inf3 Computer Architecture - 2017-2018 15

VLIW Processors Compiler chooses and packs independent instructions into a single long instruction word or bundle Compiler responsible for avoiding hazards Keeps hardware simple Compiler s schedule must be conservative to guarantee safety Not all portions of the long instruction word will be used in every cycle Compiler must be able to expose a lot of parallelism in the schedule to attain good performance Example: MEM op 1 MEM op 2 FP op 1 FP op 2 INT op ld f18,-32(r1) ld f22,-40(r1) addd f4,f0,f2 addd f8,f6,f2 Inf3 Computer Architecture - 2017-2018 16

VLIW Processors (con d) Key challenge for VLIW processors: find control-independent work to fill each word Cover data-dependent stalls (e.g., F.DIV immediately followed by a use of the result) with independent instructions Solutions: Get rid of control flow Predication Loop unrolling Move code around to maximize scheduling opportunities and minimize stalls Inf3 Computer Architecture - 2017-2018 17

Predication Idea: compiler converts control flow dependencies into data dependencies In effect, branches are replaced with conditional execution How? In practice, with conditional commit Instructions on both paths of the branch are executed Branch instructions are completely eliminated! Each instruction has a predicate bit that is set based on the computation of the predicate Only instructions with TRUE predicates are committed Inf3 Computer Architecture - 2017-2018 18

Predication if (cond) b = 0; else b = 1; x = b + 1; Normal code (branch-based control flow) Predicated code JOIN: Inf3 Computer Architecture - 2017-2018 19

Predication Pros & Cons Advantages: avoid pipeline bubbles whenever there is a branch compiler can hide data hazards by scheduling independent instructions Disadvantages: More instructions executed (bad for power) Potentially longer critical path than if only the correct side of the branch was executed Inf3 Computer Architecture - 2017-2018 20

Loop Unrolling Idea: replicate loop body multiple times within one iteration of the loop Advantages: Reduces loop maintenance overhead (induction variable update, condition code computation, branch) Enlarges basic block size, thus maximizing scheduling opportunities Basic block: a sequence of instructions with exactly one entry point and exactly one exit point Drawback: Need extra code to detect & deal with cases when unroll factor not multiple of iteration count In practice, this is a small price to pay Inf3 Computer Architecture - 2017-2018 21

Safety and legality of code motion Two characteristics of code motion: Safety: whether or not spurious exceptions may occur Legality: whether or not the result is guaranteed correct Inf3 Computer Architecture - 2017-2018 22

Safety and legality of code motion Two characteristics of code motion: Safety: whether or not spurious exceptions may occur Legality: whether or not the result is guaranteed correct Inf3 Computer Architecture - 2017-2018 23

Multiflow and Trace Scheduling Multiflow Computer: pioneered the VLIW design style Available in VLIW widths of 7, 14, and 28 ops/inst Max width of 28 required a 1024-bit word Introduced powerful compilation techniques, particularly trace scheduling Trace scheduling fused multiple basic blocks on hot code paths into regions called traces These traces created rich opportunities for code motion Original program Hot path Trace Inf3 Computer Architecture - 2017-2018 Trace with compensating code 24

Modern-day case study: TriMedia TM 1000 5 issue slots, 28 functional units, 128 registers Each inst within a word can be predicated Any one of 128 regs can serve as the predicate (LSB is used) No dynamic hazard detection (compiler s job) Except on cache misses, which will lock the pipeline Non-excepting loads enable load speculation (no virtual mem) Inf3 Computer Architecture - 2017-2018 25

Superscalar vs. VLIW Processors SuperScalars + Able to handle dynamic events like cache misses, unpredictable memory dependences, branches, etc. + Can exploit old binaries from previous implementations - Complexity limits issue width to 4-8 VLIW + Much simpler hardware implementation + Implementations can have wider issue than superscalars - Require more complex compiler support - Cannot use old binaries when pipeline implementation changes - Code size increases because of empty issue slots Inf3 Computer Architecture - 2017-2018 26

What are the practical Limitations to ILP? Limitations on max issue width and instruction window size Effects of realistic branch prediction The effect of limited numbers of rename registers Memory aliasing Variable memory latencies (because of caches) gcc espresso li fpppp doduc tomcatv 55 63 18 75 119 150 0 50 100 150 200 Instruction issues per cycle Available ILP in a perfect processor, with none of the above constraints. - 6 SPEC92 benchmarks - the first 3 are Int, the last 3 are FP These levels of ILP are impossible to achieve in practice due to limitations above H&P 4 th ed, fig 3.1 (p.157) Inf3 Computer Architecture - 2017-2018 27

Effect of Instruction Window (i.e., ROB) 160 150 140 Instructions Per Clock 120 100 80 60 40 55 36 63 41 119 75 61 59 60 49 45 35 34 20 15 18 15 10 10 13 12 8 8 119 14 1615 9 14 0 gcc espresso li fpppp doduc tomcatv H&P 4 th ed, fig 3.2 (p.159) Infinite 2048 512 128 32 Inf3 Computer Architecture - 2017-2018 28

Effect of Branch prediction (2K ROB) 60! 58! 60! 50! 48! 46! 45! 46! 45! 45! Instruction issues per cycle! 40! 30! 20! 10! 35! 9! 6! 41! 16! 12! 10! 6.4! 7! 6.3! 6! 6.5! 2! 1.8! 2.3! 28.5! 15! 13! 14! 4.1! 18.6! 0! H&P 4 th ed, fig 3.3 (p.160)! gcc! espresso! li! fpppp! doducd! tomcatv! Program! Perfect! Selective predictor! Standard 2-bit! Static! None! Inf3 Computer Architecture - 2017-2018 30

Limits to Multiple-issue Fundamental limits to ILP in most programs: Need N independent instructions to keep a W-issue processor busy, where N = W * pipeline depth Data and control dependences significantly limit amount of ILP Complexity of the hardware based on issue width: Number of functional units increases linearly OK Number of ports for register file increases linearly bad Number of ports for memory increases linearly bad Number of dependence tests increases quadratically bad Bypass/forwarding logic and wires increases quadratically bad These two tend to ultimately limit the width of practical dynamically-scheduled superscalars Inf3 Computer Architecture - 2017-2018 32

Summary of Factors Limiting ILP in Real Programs Compared with an ideal processor Limited instruction window Finite number of registers (introduces WAW and WAR stalls) Imperfect branch prediction (pipeline flushes) Limited issue width Instruction fetch delays (cache misses, across-block fetch) Imperfect memory disambiguation (conservative RAW stalls) Implications for future performance growth? Single processor has inherent limits To use future silicon area, need to go to multiple processors Inf3 Computer Architecture - 2017-2018 33

Acknowledgement These slides contain material developed by Onur Mutlu (CMU) for his ECE 447 course. Inf3 Computer Architecture - 2017-2018 34