Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Similar documents
EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

Static Branch Prediction

Foundations of Computer Systems

Wide Instruction Fetch

Register Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Superscalar Processors

ECE/CS 552: Introduction to Superscalar Processors

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Copyright 2012, Elsevier Inc. All rights reserved.

Handout 2 ILP: Part B

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced Computer Architecture

5008: Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

John P. Shen Microprocessor Research Intel Labs March 19, 2002

E0-243: Computer Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Hardware-Based Speculation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Instruction-Level Parallelism and Its Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

November 7, 2014 Prediction

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

COSC3330 Computer Architecture Lecture 14. Branch Prediction

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Lecture-13 (ROB and Multi-threading) CS422-Spring

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Processor: Superscalars Dynamic Scheduling

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Dynamic Branch Prediction

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Getting CPI under 1: Outline

Instruction Level Parallelism

Multiple Instruction Issue and Hardware Based Speculation

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Instruction Level Parallelism

Superscalar Processor Design

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

EC 513 Computer Architecture

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

Instruction Level Parallelism (ILP)

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

Adapted from David Patterson s slides on graduate computer architecture

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Dynamic Scheduling. CSE471 Susan Eggers 1

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Four Steps of Speculative Tomasulo cycle 0

CS 152, Spring 2011 Section 8

Midterm Exam 1 Wednesday, March 12, 2008

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Tutorial 11. Final Exam Review

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Hardware-based Speculation

Computer Architecture Spring 2016

HY425 Lecture 05: Branch Prediction

Multiple Instruction Issue. Superscalars

Lecture 19: Instruction Level Parallelism

Lecture: Out-of-order Processors

Static vs. Dynamic Scheduling

Hardware-based Speculation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Static Branch Prediction

COMPUTER ORGANIZATION AND DESI

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Metodologie di Progettazione Hardware-Software


Chapter 4 The Processor 1. Chapter 4D. The Processor

Transcription:

Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis Describe the methodology you will use (experiments, tools, machines) Statement of expected results Few key references to related work Still missing some photos Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 6-1 Lecture 6-2 1. Branch Prediction (epilog) Lecture 6 Outline a. 2-level Predictors b. AMD Opteron Example c. Confidence Prediction d. Trace Cache 2. Register Data Flow a. False Register Dependences b. Register Renaming Technique c. Register Renaming Implementation ti Dynamic Branch Prediction Using History specu. cond. npc=bp(pc) prediction specu. target t Branch Predictor (using a BTB) BTB update (target addr. and history) FA-mux PC Issue Branch Execute npc to Icache npc(seq.) = PC+4 Fetch Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations Finish Completion Buffer Lecture 6-3 Lecture 6-4

2-Level Adaptive Prediction [Yeh & Patt] Nomenclature: {G,P}A{g,p,s} {g,p, } Example: Global BHSR Scheme (GAs) PC Branch History Shift Register (BHSR) (shift left when update) 00...00 00...01 00...10 Pattern History Table (PHT) 1 0 1 1 1 PHT Bits 1 1 1 0 0 old new index 11...10 1 1 1 1 0 11...11 Prediction FSM Logic Branch Result To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 2 18 x 2 bits total = 524 kbits P (512x4) BHR: 12 bits; g (1) PHT: 2 12 x 2bit bits total = 33 kbits P (512x4) BHR: 6 bits; s (512) PHT: 2 6 x 2 bits total = 78 kbits Branch Address Branch History Shift Register (BHSR) k bits jbits BHT of 2 x 2 j+k iction Predi Lecture 6-5 Lecture 6-6 Example: Per-Branch BHSR Scheme (PAs) Gshare Branch Prediction [McFarling] Branch Address i bits jbits Standard BHT Branch Address j bits Branch History Shift Register (BHSR) k x 2 i ction xor Branch History Shift Register (BHSR) Predi Predict tion k bits kbit bits Lecture 6-7 BHT of 2 x 2 j+k Lecture 6-8 BHT of 2 x 2 max(j,k)

Fetch & Predict Example: AMD Opteron Why is Prediction Important in Opteron? Lecture 6-9 Lecture 6-10 Fetch & Predict Example: AMD Opteron Other Branch Prediction Related Issues Multi-cycle BTB Keep fetching sequentially, repair later (bubbles for taken branches) Need pipelined access though BTB & predictor in series Get fast target/direction prediction from BTB only After decoding, use predictor to verify BTB Causes a pipeline pp mini-flush if BTB was wrong This approach allows for a much larger/slower predictor BTB and predictor integration Can merge BTB with the local l part of a predictor Can merge both with I-cache entries Predictor/BTB/RAS updates Can you see any issue? Lecture 6-11 Lecture 6-12

Prediction Confidence A Very Useful Tool for Speculation Branch Confidence Prediction Estimate if your prediction is likely to be correct Applications Avoid fetching down unlikely path Save time & power by waiting Start executing down both paths (selective eager execution) Switch to another thread (for multithreaded processors) Implementation Naïve: don t use NT or TN states in 2-bit counters Better: array of CIR (correct/incorrect registers) Shift in if last prediction was correct/incorrect Count the number of 0s to determine confidence Many other implementations are possible Using counters etc Lecture 6-13 Lecture 6-14 Dynamic History Length Understanding Advanced Predictors Four types of history Local (bimodal) history (Smith predictor) Table of counters summarizes local history Simple, but only effective for biased branches Local outcome history Shift register of individual branch outcomes Separate counter for each outcome history Global l outcome history Shift register of recent branch outcomes Separate counter for each outcome history Path history Shift register of recent (partial) block addresses Can differentiate similar global outcome histories Can combine or alloy histories in many ways History length Short history lower training i cost Long history captures macro-level behavior Variable history length predictors Really long history (long loops) Loop count predictors Fourier transform into frequency domain Limited capacity & interference Constructive vs. destructive Bi-mode, gskewed, agree, YAGS Read sec. 9.3.2 carefully Lecture 6-15 Lecture 6-16

High-Bandwidth Fetch: Trace Cache Intel Pentium 4 Trace Cache Instruction Cache E F G H I J Trace Cache Front-End BTB Instruction TLB and Prefetcher Level-Two Unified Data and Instruction Cache D A B C A B C D E F G H I J Trace Cache BTB Instruction Decode Trace Cache Instruction Fetch Queue (a) Fold out taken branches by tracing instructions as they commit into a fill buffer Eric Rotenberg, S. Bennett, and James E. Smith. Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching. MICRO, December 1996. (b) To renamer, execute, etc. No first-level instruction cache: trace cache only Trace cache BTB identifies next trace Miss leads to fetch from level two cache Trace cache instructions are decoded (uops) Lecture 6-17 Lecture 6-18 Modern Superscalar, Out-of-order Processor What Limits ILP Register Data Flow Branch Predictor I-cache FETCH Instruction Buffer DECODE Instruction Flow Integer Floating-point Media Memory Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Pipelining reduces cycle time Superscalar increases IPC (instruction per cycle) Both schemes need to find lots of ILP in the program Memory Data Flow Must simultaneously increase number of instructions considered, number of instructions executed, and allow for out-of-order execution INSTRUCTION PROCESSING CONSTRAINTS Resource Contention ti Code Dependences (Structural Dependences) Control Dependences Data Dependences (RAW) True Dependences Storage Conflicts (WAR) Anti-Dependences Output Dependences (WAW) Lecture 6-19 Lecture 6-20

Register Renaming & Dynamic Scheduling Register Renaming: address limitations of the scoreboard Scoreboard limitation Up to one pending instruction per destination register Eliminate WAR and WAW dependences without stalling Dynamic scheduling Track & resolve true-data dependences (RAW) Scheduling hardware: Instruction window, reservation stations, common data bus, Original proposal: Tomasulo s algorithm [Tomasulo, 1967] Each ALU Instruction: Register Data Flow INSTRUCTION EXECUTION MODEL Ri Fn (Rj, Rk) R0 R1 FU 1 FU 2 Dest. Funct. Source Interconnect Reg. Unit Registers Rm FU n Register Transfer Registers Functional Units Read Write Execute Need Availability of Fn (Structural Dependences) Need Availability of Rj, Rk (True Data Dependences) Need Availability of Ri (Anti-and output Dependences) Lecture 6-21 Lecture 6-22 Causes of (Register) Storage Conflict REGISTER RECYCLING MAXIMIZE USE OF REGISTERS MULTIPLE ASSIGNMENTS OF VALUES TO REGISTERS OUT OF ORDER ISSUING AND COMPLETION LOSE IMPLIED PRECEDENCE OF SEQUENTIAL CODE LOSE 1-1 CORRESPONDENCE BETWEEN VALUES AND REGITERS Ri DEF WAW Ri USE Ri USE WAR Ri Lecture 6-23 DEF The Reason for WAW and WAR: Register Recycling COMPILER REGISTER ALLOCATION CODE GENERATION REG. ALLOCATION INSTRUCTION LOOPS 9 $34: mul $14 $7, 40 10 addu $15, $4, $14 11 mul $24, $9, 4 12 addu $25, $15, $24 13 lw $11, 0($25) 14 mul $12, $9, 40 15 addu $13, $5, $12 16 mul $14, $8, 4 17 addu $15, $13, $14 18 lw $24, 0($15) 19 mul $25, $11, $24 20 addu $10, $10, $25 21 addu $9, $9, 1 22 ble $9, 10, $34 Single Assignment, Symbolic Reg. Map Symbolic Reg. to Physical Reg. Maximize Reuse of Reg. Lecture 6-24 For (k=1;k<= 10; k++) t += a [i] [k] * b [k] [j] ; Reuse Same Set of Reg. in each Iteration Overlapped Execution of different Iterations

Resolving False Dependences Register Renaming: The Idea (1) R4 R3 + 1 (2) R3 R5 + 1 (1) R3 R3 + R5 R3 (2) R3 R5 + 1 Must Prevent (2) from completing before (1) is dispatched Must Prevent (2) from completing before (1) completes Stalling: delay dispatching (or write back) of the later instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR) Register Renaming: use a different register (WAW & WAR) Anti and output dependences are false dependences r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 The dependence is on name/location rather than data Given unlimited number of registers, anti and output dependences can always be eliminated Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 - r4 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 -r4 Lecture 6-25 Lecture 6-26 Register Renaming Technique Register Renaming Implementation Register Renaming Resolves: Anti-Dependences Output Dependences Architected Physical Registers Registers R1 R2 Rn Design of Redundant Registers : Number: One Multiple Allocation: Fixed for Each Register P1 Pooled for all Regsiters P2 Location: Attached to Register File (Centralized) Pn Attached to functional units (Distributed) Pn + k Renaming: Map a small set of architecture registers to a large set of physical registers New mapping for an architectural register when it is assigned a new value Renaming buffer organization (how are registers stored) Unified RF, split RF, renaming in the ROB RF = register file Number of renaming registers Number of read/write ports Register mapping (how do I find the register I am looking for) Allocation, de-allocation, and tracking Lecture 6-27 Lecture 6-28

Renaming Buffer Options Unified Register File: Physical Register FSM Unified/merged register file MIPS R10K, Alpha 21264 Registers change role architecture to renamed Rename register file (RRF) PA 8500, PPC 620 Holds new values until they are committed to ARF Extra data transfer Renaming in the ROB Pentium III Note: can have a single scheme or separate for integer/fp Lecture 6-29 Lecture 6-30 Number of Rename Registers Register File Ports Naïve: as many as the number of pending instructions Waiting to be scheduled + executing + waiting to commit Simplification Do not need renaming for stores, branches, Usual approach: # scheduler entries # RRF entries # ROB entries Examples: PPC 620: scheduler 15, RRF 16 (RRF), ROB 16 MIPS R12000: scheduler 48, RRF 64 (merged), ROB 48 Pentium III: scheduler 20, RRF 40 (in ROB), ROB 40 Read: if operands read as instructions enter scheduler Max # ports = 2 * # instructions dispatched Read: if operands read as instruction leave scheduler Max #ports = 2* # instructions issued Can be wider than the # of instructions ti dispatched d Write: # of FUs or # of instructions committing Depends on unified vs separate rename registers Notes: Can implement pe e less esspotsa ports and have estuctua structural hazards ads Need control logic for port assignment & hazard handling When using separate RRF and ARF, need ports for the final transfer Alternatives to increasing gports: duplicated RF or banked RF What are the issues? Lecture 6-31 Lecture 6-32

Register Mapping (From Architectural to Physical Address) Integrating Map Tables with the ARF Option 1: use a map table (ARF # physical location) Map holds the state of the register too Simple, but need two steps for reading (ok if operands read late) Option 2: associative search in RRF, ROB, Each physical register remembers its status (ok if operand read early) More complicated but one step read Lecture 6-33 Lecture 6-34 Renaming Operation: Allocation, Lookup, De-allocation Renaming Operation At dispatch: for each instruction handled in parallel Check the physical location & availability of source operands Map destination register to new physical register Stall if no register available Note: must have enough ports to any map tables At complete: update physical location At commit/retire: for each instruction handled in parallel Copy from RRF/ROB to ARF & deallocate RRF entry OR Upgrade physical location and deallocate register with old value It is now safe to do that Question: can we allocate later or deallocate earlier? Lecture 6-35 Lecture 6-36

Renaming Difficulties: Wide Instruction Issue Need many ports in RFs and mapping tables Renaming Difficulties: Mispredictions & Exceptions If exception/misprediction occurs, register mapping must be precise Instruction dependencies during dispatching/issuing/committing Must handle dependencies across instructions E.g. add R1 R2+R3; sub R6 R1+R5 Implementation: use comparators, multiplexors, counters Comparators: discover RAW dependencies Multiplexors: generate right physical address (old or new allocation) Counters: determine number of physical registers allocated Separate RRF: consider all RRF entries free ROB renaming: consider all ROB entries free Unified RF: restore precise mapping Single map: traverse ROB to undo mapping (history file approach) ROB must remember old mapping Two maps: architectural and future register map On exception, copy architectural map into future map Checkpointing: keep regular check points of map, restore when needed When do we make a checkpoint? On every instruction? On every branch? What are the trade-offs? We ll revisit it this approach later on Lecture 6-37 Lecture 6-38 Dynamic Scheduling Based on Reservation Stations Embedded Data Flow Engine inord der out-of- -order Allocate Reorder Buffer entries Dispatch Buffer Branch Reg. Write Back Dispatch Reg. File Ren. Reg. Integer Integer Float.- Load/ Point Store Reservation Stations Reservation Stations Dispatch Buffer Dynamic Execution Branch Dispatch - Read register or - Assign register tag - Advance instructions to reservation stations - Monitor reg. tag - Receive data being forwarded - Issue when all operands ready inor rder Compl. Buffer (Reorder Buff.) Complete Completion Buffer Complete Lecture 6-39 Lecture 6-40