Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8) Advanced techniques (2.9)
Lectures 1. Introduction 2. Instruction-level Parallelism, part 1 3. Instruction-level Parallelism,,part 2 4.Memory Hierarchies 5. Multiprocessors and Thread-Level Parallelism 6. System Aspects and Virtualization 7. Summary and Review
Better performance in pipeline for (i=1000; i>0; i=i-1) x[i] [] = x[i] [] + 10.0; loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, loop NOP Basic pipeline with many stalls Dynamic scheduling, prediction, speculation, multiple-issue, l i etc...
General Processor Organization Memory access Fetch instruction Get operands & Issue Integer & Logic Update state Floating point Major bottlenecks Control hazards, memory performance => Fetch bottleneck Data hazards, structural hazards, control hazards => Issue bottleneck
Fetch Bottleneck Control hazards Dynamic branch prediction: Predict outcome of branches and jumps Branch target buffers Memory bottleneck Memory performance improvement (memory hierarchy, prefetch, )
Issue Bottleneck RAW hazards Dynamic scheduling (out-of-order execution) WAR & WAW hazards Remove name dependencies (register renaming) Structural hazards Dynamic scheduling (out-of-order execution) Memory performance improvement (memory hierarchy, prefetch, non-blocking, load/store queues) Multiple and pipelined functional units Control hazards Speculative execution Single issue Issue multiple instructions per cycle (superscalar, VLIW)
Dynamic Instruction Scheduling (Ch. 2.4) Key idea: Allow subsequent independent instructions to proceed Instr. gets stuck here DIVD F0,F2,F4 ; takes long time ADDD F10F0F8 F10,F0,F8 ; stalls waiting for F0 SUBD F12,F8,F13 ; Let this instr. bypass the ADDD Enables out-of-order execution => out-of-order completion IF ID EX M WB Two historical schemes used in recent machines: Scoreboard dates back to CDC 6600 in 1963 Tomasulo s s algorithm in IBM 360/91 in 1967
Tomasulo s s Algorithm
Basic Ideas Decouple issue from operand fetch Prevents stall due to RAW hazards Register renaming: Translate register references to instruction (functional unit) references Prevents WAR and WAW hazards
Three Stages of Tomasulo s Alg. 1. Issue get g instruction from FP Op Queue Issue if no structural hazard for a reservation station 2. Execution operate operate on operands (EX) Execute when both operands are available; if not ready, watch Common Data Bus (CDB) for result 3. Wi Write result finish i h execution (WB) Write on CDB to all awaiting functional units; mark reservation station available rmal bus: data + destination Common Data Bus: data + source
Tomasulo and Dynamic Branch Prediction Tomasulo s s algorithm assumes instruction completed when result is written Dynamic branch prediction allows instructions to be speculatively issued after a branch until branch has executed With dynamic scheduling according to Tomasulo s algorithm, instructions following a predicted branch must not execute and write result until prediction is verified!
Hardware-Based Speculative Execution Tomasulo s s algorithm provides speculative issue, but not speculative execution This may be a serious bottleneck, especially for programs with a high branch frequency Speculative execution requires Separate execution from commit Keep track of temporary results Commit instructions in program order
HW Support for Speculation Need a reorder buffer for uncommited inst. Reorder buffer (ROB) can be operand source Once operation commits, the register file is updated Use reorder buffer number instead of reservation station Instructions commit in order Flush reorder buffer when a branch is mispredicted Store buffers integrated into the ROB.
Four Steps of a Speculative Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer nr. for destination 2. Execution operate on operands (EX) If both operands ready: execute; if not, watch CDB for result; when both operands are in reservation station: execute 3. Write result finish execution Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available 4. Commit update register with reorder result When instr. is at head of reorder buffer & result is present; update register with result (or store to memory) and remove instr. from reorder buffer
ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 L.D F2, 44(R3) Write result F2 M[44+Reg[R3]] #3 MUL.D F0, F2, F4 Execute F0 #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #3 #2 #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 L.D 44 Reg[R3] #2 44+Rg[R3] Add1 Add2 Add3 Mult1 MUL.D Reg[F4] #2 #3 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5
ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 MUL.D F0, F2, F4 Execute F0 #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #3 #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 MUL.D M[44+Reg[R2]] Reg[F4] #3 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5
ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 MUL.D F0, F2, F4 Write result F0 #2 x Reg[F4] #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #3 #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 MUL.D M[44+Reg[R2]] Reg[F4] #3 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5
ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 no MUL.D F0, F2, F4 Committed F0 #2 x Reg[F4] #4 SUB.D F8, F2, F6 Ready F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #6 #4 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5
ROB entry Busy Instruction State Destination/ Address Value #1 no L.D F6, 32(R2) Committed F6 M[34+Reg[R2]] #2 no L.D F2, 44(R3) Committed F2 M[44+Reg[R3]] #3 no MUL.D F0, F2, F4 Committed F0 #2 x Reg[F4] #4 SUB.D F8, F2, F6 Committed F8 #2 - #1 #5 DIV.D F10, F10, F6 Execute F10 #6 ADD.D F6, F6, F2 Ready F6 #4 + #2 Reg: F0 F1 F2 F3 F4 F5 F6 F7 F8 F10 ROB# #6 #5 Busy Reservation station Busy Op Vj Vk Qj Qk Dest A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 DIV.DD M[34+Reg[R2]] [R2]] #3 #5
Arithmetic/Logic Operation Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Tag destination register with ROB entry Write destination register to ROB entry Mark ROB entry as busy 2. Execute after issue Wait for operand values on CDB (if not already available) Compute result 3. Write result when CDB and ROB available Send result on CDB to reservation stations Update ROB entry with result, and mark as ready Free reservation station 4. Commit when at head of ROB and ready Update destination register with result from ROB entry Untag destination register Free ROB entry
Branch Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Write destination address and outcome prediction to ROB entry Mark ROB entry as busy 2. Execute after issue Wait for operand values on CDB (if not already available) Compute result (branch condition) 3. Write result when ROB available Update ROB entry with result, and mark as ready Free reservation station 4. Commit when at head of ROB and ready Update branch predictors with result If result did not agree with prediction Flush ROB, reservation stations, and fetch queue Send correct next instruction address to instruction fetch unit Else, free ROB entry
Load Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Tag destination register with ROB entry Write destination register to ROB entry Mark ROB entry as busy 2. Execute step 1 after issue Wait for base address register value on CDB (if not already available) Compute address 3. Execute step 2 Wait if previous store to the same address (or with yet unknown address) is in the ROB Read result from memory 4. Write result when CDB and ROB available Send result on CDB to reservation stations Update ROB entry with result, and mark as ready Free reservation station 5. Commit when at head of ROB and ready Update destination register with result from ROB entry Untag destination register Free ROB entry
Store Processing 1. Issue when reservation station and ROB entry is available Read already available operands from registers and instruction Tag unavailable operands with ROB entry Mark ROB entry as busy 2. Execute after issue Wait for operand values on CDB (if not already available) Compute address and store it in ROB entry 3. Write result when CDB and ROB available Update ROB entry with source register value, and mark as ready Free reservation station 4. Commit when at head of ROB and ready Write result (source register value) to memory at computed address Free ROB entry
Exception Handling Using a ROB solves the problem of precise exceptions! Mark each instruction in the ROB with information about any exceptions caused by it Do not act on exceptions until commit If exception is detected at commit, treat the instruction (almost) like a mispredicted branch Flush the ROB and fetch queues Start t fetching instructions ti from the exception handler Program exception behavior will be preserved!
General Processor Organization ROB Register ROB Memory file CDB Branch predictor RS Memory access Fetch instruction Fetch queue Get operands & Issue RS Integer & Logic Write result RS Floating point Dynamic scheduling with speculative execution
Multiple Issue Memory access Fetch instruction Get operands & Issue Integer & Logic Update state Floating point Issue several instructions per cycle
Multiple Issue Superscalar Issue variable number of instructions per cycle depending on hazards Dynamic superscalar Schedule the instructions dynamically Static superscalar Do not schedule dynamically Very Long Instruction Word (VLIW) Issue fixed number of instructions per cycle Rely on static scheduling only
VLIW Requires wider instructionsi Simpler control logic Difficult to find a sufficient number of instructions to issue Code becomes hardware dependent
Dynamic Superscalar with Speculative Execution ROB Register ROB Memory file CDB Branch predictor RS Memory access Fetch instruction Fetch queue Get operands & Issue RS Integer & Logic Write result RS Floating point Relatively easy to extend up to 2-4 issues per cycle
Advanced Techniques Branch-Target Buffers (BTB) Return Address Predictors Register Renaming
Branch Target Buffer Dynamic branch prediction provides fetch unit with prediction taken/not taken Branch Target Buffer (BTB) Stores predicted address of next instruction for taken branches Functions as a cache memory that is indexed by addresses of taken branches Without t BTB, no instruction ti can be fetched after a predicted taken branch until branch address has been computed => potentially long stalls With BTB, stall time for a correctly predicted branch can become zero (if hit in BTB)
Return Address Predictor BTB does not work well for subroutine returns and indirect jumps Calls can be made from many different places Return address predictors can help subroutine returns Push return address on small stack when call is detected Works perfectly as long as the stack is deep enough
Register Renaming ROB and/or reservation stations in dynamic scheduling provide register renaming Each instruction is provided with a unique location to store its result => WAR and WAW hazards avoided When committed, the result is always written to the register file With pure register renaming used in most high-end modern processors A larger set of physical registers is available than there are architectural registers Each instruction is assigned a free physical register A mapping is maintained between physical and architectural registers Mappings to a physical register can be marked speculative until the instruction commits, and then become permanent or removed This avoids the intermediate stages of storing results first in reservation stations, then in ROB, and then in registers
Summary: Fetch Bottleneck Control hazards Dynamic branch prediction: Predict outcome of branches and jumps Branch target buffers Memory bottleneck Memory performance improvement (memory hierarchy, prefetch, )
Summary: Issue Bottleneck RAW hazards Dynamic scheduling (out-of-order execution) WAR & WAW hazards Remove name dependencies (register renaming) Structural hazards Dynamic scheduling (out-of-order execution) Memory performance improvement (memory hierarchy, prefetch, non-blocking, load/store queues) Multiple and pipelined functional units Control hazards Speculative execution Single issue Issue multiple instructions per cycle (superscalar, VLIW)
Summary Next lecture (4) we will look at what can be done about memory performance (chapter 5: 5.1-5.2)