EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
|
|
- Scot Skinner
- 6 years ago
- Views:
Transcription
1 EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University Lecture 7-1
2 Announcements Project proposal due on Wed 10/ pages submitted through List the group members Describe the topic including why it is important and your thesis Describe the methodology you will use (experiments, tools, machines) Statement of expected results Few key references to related work Lecture 7-2
3 What Limits ILP INSTRUCTION PROCESSING CONSTRAINTS Resource Contention ti Code Dependences (Structural Dependences) Control Dependences Data Dependences (RAW) True Dependences Storage Conflicts (WAR) Anti-Dependences Output Dependences (WAW) Lecture 7-3
4 The Reason for WAW and WAR: Register Recycling COMPILER REGISTER ALLOCATION CODE GENERATION Single Assignment, Symbolic Reg. REG. ALLOCATION INSTRUCTION LOOPS 9 $34: mul $14 $7, addu $15, $4, $14 11 mul $24, $9, 4 12 addu $25, $15, $24 13 lw $11, 0($25) 14 mul $12, $9, addu $13, $5, $12 16 mul $14, $8, 4 17 addu $15, $13, $14 18 lw $24, 0($15) 19 mul $25, $11, $24 20 addu $10, $10, $25 21 addu $9, $9, 1 22 ble $9, 10, $34 Map Symbolic Reg. to Physical Reg. Maximize Reuse of Reg. Lecture 7-4 For (k=1;k<= 10; k++) t += a [i] [k] * b [k] [j] ; Reuse Same Set of Reg. in each Iteration Overlapped Execution of different Iterations
5 Resolving False Dependences (1) R4 R3 + 1 Must Prevent (2) from completing before (1) is dispatched (2) R3 R5 + 1 (1) R3 R3 + R5 R3 (2) R3 R5 + 1 Must Prevent (2) from completing before (1) completes Stalling: delay dispatching (or write back) of the later instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR) Register Renaming: use a different register (WAW & WAR) Lecture 7-5
6 Register Renaming: The Idea Anti and output dependences are false dependences r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 The dependence is on name/location rather than data Given unlimited number of registers, anti and output dependences can always be eliminated Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 - r4 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 -r4 Lecture 7-6
7 Register Renaming Technique Register Renaming Resolves: Design of Redundant Registers : Anti-Dependences Output Dependences Number: One Architected Physical Multiple Registers Registers Allocation: R1 P1 Fixed for Each Register Pooled for all Regsiters R2 P2 Location: Attached to Register File (Centralized) Pn Rn Attached to functional units (Distributed) Pn + k Lecture 7-7
8 Integrating Map Tables with the ARF Lecture 7-8
9 Register Renaming Operations At Decode/Dispatch: for each instruction handled in parallel 1. Source Read: Check availability of source operands 2. Destination Allocate: Map destination register to new physical register Stall if no register available Note: must have enough ports to any map tables At finish: 3. Register Update: update physical register At Complete/Commit: for each instruction handled in parallel 3. Register Update: update architectural register Copy from RRF/ROB to ARF & deallocate RRF entry; OR Upgrade physical location and deallocate register with old value It is now safe to do that Question: can we allocate later or deallocate earlier? Lecture 7-9
10 Renaming Operation Lecture 7-10
11 Renaming Buffer Options 1. Unified/merged register file MIPS R10K, Alpha Registers change role architecture to renamed 2. Rename register file (RRF) PA 8500, PPC 620 Holds new values until they are committed to ARF Extra data transfer 3. Renaming in the ROB Pentium III Note: can have a single scheme or separate for integer/fp Lecture 7-11
12 Unified Register File: Physical Register FSM Lecture 7-12
13 Register Renaming in the IBM RS6000 FPU FPU Register Renaming OP T S1 S2 S3 OP T S1 S2 S3 FAD FAD R7: R32 Map table 32 x 6 Simplified FPU Register Model head <= R7 (actual last use) Fload R7 <= Mem[] Free List tail Free when Fload R7 commits Pending Target Return Queue 7 head release tail Incoming FPU instructions pass through a renaming table prior to decode The 32 architectural registers are remapped to 40 physical registers Physical register names are used within the FPU Complex control logic maintains active register mapping Lecture 7-13
14 Renaming Difficulties: Wide Instruction Issue Need many ports in RFs and mapping tables Instruction dependences during dispatching/issuing/committing Must handle dependencies across instructions E.g. add R1 R2+R3; sub R6 R1+R5 Implementation: use comparators, multiplexors, counters Comparators: discover RAW dependencies Multiplexors: generate right physical address (old or new allocation) Counters: determine number of physical registers allocated Lecture 7-14
15 Renaming Difficulties: Mispredictions & Exceptions If exception/misprediction occurs, register mapping must be precise Separate RRF: consider all RRF entries free ROB renaming: consider all ROB entries free Unified RF: restore precise mapping Single map: traverse ROB to undo mapping (history file approach) ROB must remember old mapping Two maps: architectural and future register map On exception, copy architectural map into future map Checkpointing: keep regular check points of map, restore when needed d When do we make a checkpoint? On every instruction? On every branch? What are the trade-offs? We ll revisit it this approach later on Lecture 7-15
16 Dataflow Engine for Dynamic Execution - Read register or - Assign register tag - Advance instructions to reservation stations Dispatch Buffer Reg. Write Back Dispatch Reg. File Ren. Reg. Allocate Reorder Buffer entries - Monitor reg. tag - Receive data being forwarded - Issue when all operands ready Branch Reservation Stations Integer Integer Float.- Load/ Forwarding Point Store results to Res. Sta. & rename registers Compl. Buffer (Reorder Buff.) Complete Lecture 7-16 Managed as a queue; Maintains sequential order of all Instructions in flight ( takeoff = dispatching; landing = completion)
17 Historical Background Dynamic or Data-flow Scheduling: Scheduling hardware allows instructions to be executed as soon as its source operands are ready and a FU is available Assuming renaming, only limited by RAW and structural hazards First proposal: Tomasulo s algorithm in IBM 360/91 FPU (1967) 1 instruction per cycle, distributed implementation, imprecise exceptions We will talk directly about modern implementations Read the original in the textbook Differences: renaming, precise exceptions, multiple instructions per cycle, Lecture 7-17
18 Steps in Dynamic Execution (1) Fetch instruction (in-order, speculative) I-cache access, predictions, insert in a fetch buffer DISPATCH (in-order, speculative) Read operands from Register File (ARF) and/or Rename Register File (RRF) RRF may return a ready value or a Tag for a physical location Allocate new RRF entry (rename destination register) for destination Allocate Reorder Buffer (ROB) entry Advance instruction to appropriate entry in the scheduling hardware Typical name for centralized: issue queue or instruction window Typical name for distributed: reservation stations Lecture 7-18
19 Steps in Dynamic Execution (2) ISSUE & EXECUTE (out-of-order, speculative) Scheduler entry monitors result bus for rename register Tag(s) Find out if source operand becomes ready When all operands ready, issue instruction into Functional Unit (FU) and deallocate scheduler entry (wake-up & select) Subject to structural hazards & priorities When execution finishes, broadcast result to waiting scheduler entries and RRF entry COMMIT/RETIRE/GRADUATE (in-order, non-speculative) When ready to commit result into in-order state (head of the ROB): Update architectural register from RRF entry, deallocate RRF entry, and if it is a store instruction, advance it to Store Buffer Deallocate ROB entry and instruction is considered architecturally completed Update predictors based on instruction result Lecture 7-19
20 Centralized Instruction Window or Issue Queue Implementation + info for executing instruction (e.g. opcode, ROB entry, RRF entry) Lecture 7-20
21 Instruction Window Source Operand Options Option (a): read at dispatch and keep in the window Option (b): read at issue Lecture 7-21
22 ROB Implementation Lecture 7-22
23 Example: MIPS R10000 circa 1996 Lecture 7-23
24 R10000 Design Choices Register Renaming Map table lookup + dependency check on simultaneous dispatches Unified physical register file 4-deep branch stack to backup the map table on branch predictions Sequential (4-at-a-time) a time) back-tracking to recover from exceptions Instruction Queues Separate 16-entry floating point and integer instruction ti queues Prioritized, dataflow-ordered scheduling Reorder Buffer One per outstanding instruction, FIFO ordered Stores PC, logical destination number, old physical destination number Why not current physical destination number? Lecture 7-24
25 R10000 Block Diagram Lecture 7-25
26 R10000 Instruction Fetch and Branch Lecture 7-26
27 R10000 Register Renaming Lecture 7-27
28 R10000 Pipelines Lecture 7-28
29 R10000 Integer Queue Lecture 7-29
30 Priority/Select Logic Tree of arbiters that works in 2 phases First phase Request signals are propagated up the tree. Only ready instructions send requests This in turn raises the ready signal of its parent arbiter cell. At the root cell one or more of the input request signals will be high if there are one or more instructions that are ready. The root cell grants the functional unit to one of its children by raising one of its grant outputs. Second phase Grant signal is propagated down the tree to the instruction that is selected The enable signal to the root cell is high whenever the functional unit is ready to execute an instruction. Lecture 7-30
31 Priority/Select Logic Issues Selection is easier if the priority depends on instruction location Older instructions are at the bottom of window and receive priority This creates an issue of compacting/collapsing: p g As instructions depart, compress remaining towards the bottom Younger instructions will be inserted towards the top (lower priority) Compacting the window is not easy! Its complexity can affect performance (clock frequency) Often implemented in some restricted form E.g. split window into two parts, allow compaction from 2 nd half towards 1 st Trade-off between window utilization and compaction simplicity Lecture 7-31
32 Wake-up and Select Latency Assume a result becomes available in cycle i When you can start executing an instruction that waits for it? Ideal solution: in cycle i+1 Back to back executing, just like with 5-stage pipeline Requirement: the following have to work in one cycle Distribute result tag to the window & detect that instruction becomes read Select instruction for execution & forward its info/operands to FU May stress clock cycle in wide processor Alternative: split wake-up and select in separate cycles Simpler hardware, faster clock cycle Lower IPC (dependencies cost one extra cycle) Lecture 7-32
33 Result Forwarding (Common Data Bus CDB) Common data bus: used to broadcast results of FUs Broadcast destinations RF or RRF or ROB, depending on the renaming scheme Instruction window May need result or tag for the result Number of CDBs Best case, 1 per functional unit Can have less, but now we may have structural hazard Notes: CDBs can be slow as they go across large chip area Broadcast tag early Lecture 7-33
34 Dynamic Scheduling Implementation Cost To support N-way dispatch into IW per cycle Nx2 simultaneous lookups into the rename map (or associative search) N simultaneous write ports into the IW and the ROB To support N-way issue per cycle (assuming read at issue) 1 prioritized associative lookup of N entries N read ports into the IW Nx2 read ports into the RF To support N-way finish per cycle N write ports into the RF and the ROB Nx2 associative lookup and write in IW To support N-way retire per cycle N read ports in the ROB N ports into the RF (potentially) Lecture 7-34
35 Instruction Window Alternatives Single vs. multiple buffers (trade-offs?) Single centralized window Single centralized window with static alignment for different FUs Separate integer FP LSU windows Separate buffers for each FU Aka, reservation stations (see Tomasulo algorithm) Management policies to keep in mind Random access or FIFO In-order vs out-of-order within each queue Age-prioritized or criticality-based Value vs. tag only When to deallocate Reservation stations for Ld/St units are more complicated Lecture 7-35
36 MIPS R10000 Lecture 7-36
Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)
Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis
More informationRegister Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison
Register Data Flow ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Register Data Flow Techniques Register Data Flow Resolving Anti-dependences Resolving Output Dependences Resolving
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 9: Modern Superscalar Out-of-Order Processors John P. Shen & Gregory Kesden September 27, 2017 Lecture #7 Processor Architecture & Design Lecture #8 Pipelined
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationStatic vs. Dynamic Scheduling
Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationMultiple Instruction Issue and Hardware Based Speculation
Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationLecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue
Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationReduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:
Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance
More informationCS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationReferences EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)
EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationLecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationThis Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods
10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may
More informationInstruction Level Parallelism
Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationThis Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods
10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat
More informationOut of Order Processing
Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================
More informationComputer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013
18-447 Computer Architecture Lecture 13: State Maintenance and Recovery Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationCase Study IBM PowerPC 620
Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationComputer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013
18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 14: Speculation II Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS 246, Harvard University] Tomasulo+ROB Add
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationECE/CS 552: Introduction to Superscalar Processors
ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation
More informationCMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago
CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationEECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)
EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.) Warning: Crazy times coming Project handout and group formation today Help me to end class 12 minutes early P3
More informationCISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3
CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationTomasulo s Algorithm
Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationPortland State University ECE 587/687. Superscalar Issue Logic
Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are
More informationPIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PIPELINING: HAZARDS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 1 submission deadline: Jan. 30 th This
More informationHardware-based Speculation
Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences
More informationThe basic structure of a MIPS floating-point unit
Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationCS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More information15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture
More informationEEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)
EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University
More informationEECS 470 Midterm Exam Winter 2009
EECS 70 Midterm Exam Winter 2009 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 18 2 / 12 3 / 29 / 21
More informationSuperscalar Processors Ch 14
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationILP: Instruction Level Parallelism
ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationLecture: Out-of-order Processors
Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer 1 Amdahl s Law Architecture design is very bottleneck-driven
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationCOSC4201 Instruction Level Parallelism Dynamic Scheduling
COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism
More informationOutline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches
Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationLSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3
PR 0,0 ID:incmb PR ID:St: C,X LSU EE 4720 Dynamic Scheduling Study Guide Fall 2005 1.1 Introduction David M. Koppelman The material on dynamic scheduling is not covered in detail in the text, which is
More informationLecture 19: Instruction Level Parallelism
Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register
More information