EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1
Announcements Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis Describe the methodology you will use (experiments, tools, machines) Statement of expected results Few key references to related work Lecture 7-2
What Limits ILP INSTRUCTION PROCESSING CONSTRAINTS Resource Contention ti Code Dependences (Structural Dependences) Control Dependences Data Dependences (RAW) True Dependences Storage Conflicts (WAR) Anti-Dependences Output Dependences (WAW) Lecture 7-3
The Reason for WAW and WAR: Register Recycling COMPILER REGISTER ALLOCATION CODE GENERATION Single Assignment, Symbolic Reg. REG. ALLOCATION INSTRUCTION LOOPS 9 $34: mul $14 $7, 40 10 addu $15, $4, $14 11 mul $24, $9, 4 12 addu $25, $15, $24 13 lw $11, 0($25) 14 mul $12, $9, 40 15 addu $13, $5, $12 16 mul $14, $8, 4 17 addu $15, $13, $14 18 lw $24, 0($15) 19 mul $25, $11, $24 20 addu $10, $10, $25 21 addu $9, $9, 1 22 ble $9, 10, $34 Map Symbolic Reg. to Physical Reg. Maximize Reuse of Reg. Lecture 7-4 For (k=1;k<= 10; k++) t += a [i] [k] * b [k] [j] ; Reuse Same Set of Reg. in each Iteration Overlapped Execution of different Iterations
Resolving False Dependences (1) R4 R3 + 1 Must Prevent (2) from completing before (1) is dispatched (2) R3 R5 + 1 (1) R3 R3 + R5 R3 (2) R3 R5 + 1 Must Prevent (2) from completing before (1) completes Stalling: delay dispatching (or write back) of the later instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR) Register Renaming: use a different register (WAW & WAR) Lecture 7-5
Register Renaming: The Idea Anti and output dependences are false dependences r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 The dependence is on name/location rather than data Given unlimited number of registers, anti and output dependences can always be eliminated Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 - r4 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 -r4 Lecture 7-6
Register Renaming Technique Register Renaming Resolves: Design of Redundant Registers : Anti-Dependences Output Dependences Number: One Architected Physical Multiple Registers Registers Allocation: R1 P1 Fixed for Each Register Pooled for all Regsiters R2 P2 Location: Attached to Register File (Centralized) Pn Rn Attached to functional units (Distributed) Pn + k Lecture 7-7
Integrating Map Tables with the ARF Lecture 7-8
Register Renaming Operations At Decode/Dispatch: for each instruction handled in parallel 1. Source Read: Check availability of source operands 2. Destination Allocate: Map destination register to new physical register Stall if no register available Note: must have enough ports to any map tables At finish: 3. Register Update: update physical register At Complete/Commit: for each instruction handled in parallel 3. Register Update: update architectural register Copy from RRF/ROB to ARF & deallocate RRF entry; OR Upgrade physical location and deallocate register with old value It is now safe to do that Question: can we allocate later or deallocate earlier? Lecture 7-9
Renaming Operation Lecture 7-10
Renaming Buffer Options 1. Unified/merged register file MIPS R10K, Alpha 21264 Registers change role architecture to renamed 2. Rename register file (RRF) PA 8500, PPC 620 Holds new values until they are committed to ARF Extra data transfer 3. Renaming in the ROB Pentium III Note: can have a single scheme or separate for integer/fp Lecture 7-11
Unified Register File: Physical Register FSM Lecture 7-12
Register Renaming in the IBM RS6000 FPU FPU Register Renaming OP T S1 S2 S3 OP T S1 S2 S3 FAD 3 2 1 FAD 3 2 1 R7: R32 Map table 32 x 6 Simplified FPU Register Model head <= R7 (actual last use) Fload R7 <= Mem[] Free List tail 32 33 34 35 36 37 38 39 Free when Fload R7 commits Pending Target Return Queue 7 head release tail Incoming FPU instructions pass through a renaming table prior to decode The 32 architectural registers are remapped to 40 physical registers Physical register names are used within the FPU Complex control logic maintains active register mapping Lecture 7-13
Renaming Difficulties: Wide Instruction Issue Need many ports in RFs and mapping tables Instruction dependences during dispatching/issuing/committing Must handle dependencies across instructions E.g. add R1 R2+R3; sub R6 R1+R5 Implementation: use comparators, multiplexors, counters Comparators: discover RAW dependencies Multiplexors: generate right physical address (old or new allocation) Counters: determine number of physical registers allocated Lecture 7-14
Renaming Difficulties: Mispredictions & Exceptions If exception/misprediction occurs, register mapping must be precise Separate RRF: consider all RRF entries free ROB renaming: consider all ROB entries free Unified RF: restore precise mapping Single map: traverse ROB to undo mapping (history file approach) ROB must remember old mapping Two maps: architectural and future register map On exception, copy architectural map into future map Checkpointing: keep regular check points of map, restore when needed d When do we make a checkpoint? On every instruction? On every branch? What are the trade-offs? We ll revisit it this approach later on Lecture 7-15
Dataflow Engine for Dynamic Execution - Read register or - Assign register tag - Advance instructions to reservation stations Dispatch Buffer Reg. Write Back Dispatch Reg. File Ren. Reg. Allocate Reorder Buffer entries - Monitor reg. tag - Receive data being forwarded - Issue when all operands ready Branch Reservation Stations Integer Integer Float.- Load/ Forwarding Point Store results to Res. Sta. & rename registers Compl. Buffer (Reorder Buff.) Complete Lecture 7-16 Managed as a queue; Maintains sequential order of all Instructions in flight ( takeoff = dispatching; landing = completion)
Historical Background Dynamic or Data-flow Scheduling: Scheduling hardware allows instructions to be executed as soon as its source operands are ready and a FU is available Assuming renaming, only limited by RAW and structural hazards First proposal: Tomasulo s algorithm in IBM 360/91 FPU (1967) 1 instruction per cycle, distributed implementation, imprecise exceptions We will talk directly about modern implementations Read the original in the textbook Differences: renaming, precise exceptions, multiple instructions per cycle, Lecture 7-17
Steps in Dynamic Execution (1) Fetch instruction (in-order, speculative) I-cache access, predictions, insert in a fetch buffer DISPATCH (in-order, speculative) Read operands from Register File (ARF) and/or Rename Register File (RRF) RRF may return a ready value or a Tag for a physical location Allocate new RRF entry (rename destination register) for destination Allocate Reorder Buffer (ROB) entry Advance instruction to appropriate entry in the scheduling hardware Typical name for centralized: issue queue or instruction window Typical name for distributed: reservation stations Lecture 7-18
Steps in Dynamic Execution (2) ISSUE & EXECUTE (out-of-order, speculative) Scheduler entry monitors result bus for rename register Tag(s) Find out if source operand becomes ready When all operands ready, issue instruction into Functional Unit (FU) and deallocate scheduler entry (wake-up & select) Subject to structural hazards & priorities When execution finishes, broadcast result to waiting scheduler entries and RRF entry COMMIT/RETIRE/GRADUATE (in-order, non-speculative) When ready to commit result into in-order state (head of the ROB): Update architectural register from RRF entry, deallocate RRF entry, and if it is a store instruction, advance it to Store Buffer Deallocate ROB entry and instruction is considered architecturally completed Update predictors based on instruction result Lecture 7-19
Centralized Instruction Window or Issue Queue Implementation + info for executing instruction (e.g. opcode, ROB entry, RRF entry) Lecture 7-20
Instruction Window Source Operand Options Option (a): read at dispatch and keep in the window Option (b): read at issue Lecture 7-21
ROB Implementation Lecture 7-22
Example: MIPS R10000 circa 1996 Lecture 7-23
R10000 Design Choices Register Renaming Map table lookup + dependency check on simultaneous dispatches Unified physical register file 4-deep branch stack to backup the map table on branch predictions Sequential (4-at-a-time) a time) back-tracking to recover from exceptions Instruction Queues Separate 16-entry floating point and integer instruction ti queues Prioritized, dataflow-ordered scheduling Reorder Buffer One per outstanding instruction, FIFO ordered Stores PC, logical destination number, old physical destination number Why not current physical destination number? Lecture 7-24
R10000 Block Diagram Lecture 7-25
R10000 Instruction Fetch and Branch Lecture 7-26
R10000 Register Renaming Lecture 7-27
R10000 Pipelines Lecture 7-28
R10000 Integer Queue Lecture 7-29
Priority/Select Logic Tree of arbiters that works in 2 phases First phase Request signals are propagated up the tree. Only ready instructions send requests This in turn raises the ready signal of its parent arbiter cell. At the root cell one or more of the input request signals will be high if there are one or more instructions that are ready. The root cell grants the functional unit to one of its children by raising one of its grant outputs. Second phase Grant signal is propagated down the tree to the instruction that is selected The enable signal to the root cell is high whenever the functional unit is ready to execute an instruction. Lecture 7-30
Priority/Select Logic Issues Selection is easier if the priority depends on instruction location Older instructions are at the bottom of window and receive priority This creates an issue of compacting/collapsing: p g As instructions depart, compress remaining towards the bottom Younger instructions will be inserted towards the top (lower priority) Compacting the window is not easy! Its complexity can affect performance (clock frequency) Often implemented in some restricted form E.g. split window into two parts, allow compaction from 2 nd half towards 1 st Trade-off between window utilization and compaction simplicity Lecture 7-31
Wake-up and Select Latency Assume a result becomes available in cycle i When you can start executing an instruction that waits for it? Ideal solution: in cycle i+1 Back to back executing, just like with 5-stage pipeline Requirement: the following have to work in one cycle Distribute result tag to the window & detect that instruction becomes read Select instruction for execution & forward its info/operands to FU May stress clock cycle in wide processor Alternative: split wake-up and select in separate cycles Simpler hardware, faster clock cycle Lower IPC (dependencies cost one extra cycle) Lecture 7-32
Result Forwarding (Common Data Bus CDB) Common data bus: used to broadcast results of FUs Broadcast destinations RF or RRF or ROB, depending on the renaming scheme Instruction window May need result or tag for the result Number of CDBs Best case, 1 per functional unit Can have less, but now we may have structural hazard Notes: CDBs can be slow as they go across large chip area Broadcast tag early Lecture 7-33
Dynamic Scheduling Implementation Cost To support N-way dispatch into IW per cycle Nx2 simultaneous lookups into the rename map (or associative search) N simultaneous write ports into the IW and the ROB To support N-way issue per cycle (assuming read at issue) 1 prioritized associative lookup of N entries N read ports into the IW Nx2 read ports into the RF To support N-way finish per cycle N write ports into the RF and the ROB Nx2 associative lookup and write in IW To support N-way retire per cycle N read ports in the ROB N ports into the RF (potentially) Lecture 7-34
Instruction Window Alternatives Single vs. multiple buffers (trade-offs?) Single centralized window Single centralized window with static alignment for different FUs Separate integer FP LSU windows Separate buffers for each FU Aka, reservation stations (see Tomasulo algorithm) Management policies to keep in mind Random access or FIFO In-order vs out-of-order within each queue Age-prioritized or criticality-based Value vs. tag only When to deallocate Reservation stations for Ld/St units are more complicated Lecture 7-35
MIPS R10000 Lecture 7-36