Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel
Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading
Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading
Control Flow Dependencies Let b a conditional branch at address a with branch taget z. An operation c ist control flow dependent on b, if the execution of c depends on the branch of b. Otherwise c is not control flow dependent. Examples: a: c b a: b a+: z: a: b a+: z: b2 c a+: z: c d c is not control flow dependent on b c is not control flow dependent on b c is control flow dependent on b and b2 What about d?
Scheduling restrictions imposed by control flow dependencies for control flow dependent operations: cannot be moved before the branch for not control flow dependent operations: cannot be moved behind a branch b c b c b b c c Program order Speculative Execution of c Program order c may be not executed
Performance Problem due to Control Hazards Problem: Branch target of an operation is only known after execution Long pipeline stalls required in processors with deep pipelines Instruction Queue Speicher b PC Branch operation? Address for next instruction fetch is not known Solution: to the reservation stations Branch prediction helps, but is limitted Tomasulo supports speculative fetch, issue, but not execute of operations
Drawbacks of Speculative Execution What happens if an operation is executed speculatively and speculation was wrong? May affect the data flow May affect the exception behavior block to be executed b speculatively executed block c b c after dynamic scheduling executed program control flow graph of a program
Example Affected Data Flow c is executed speculatively before b mul-operation now receives the value in r from sub- instead of from addoperation Affected Exceptions Behavior c is executed speculatively before b Division by possible a c b add r <- r2,r3 sub r <- r4,r5 a c b div r <- r4,r5 if r5 = then x else y does not write in r c x: No division y: executed c mul r <- r,r6
Solution Divide WB-Phase from Tomasulo-Algorithm into two phases: Forwarding results from EU to s (WB-Phase) Writing results into architectural registers/memory (Commit-Phase) Implemented by: Reorder-Buffer for buffering results from WB-phase Committing buffered results from the Reorder-Buffer in-oder By this: Usage of speculative results possible, without modifying architectural registers/memory locations
Architecture for Tomasulo with speculative Execution Operand Bus A Program Memory Instruction Queue PC Operation Bus Reg Reg Reg 2 Reg r Architecture Register Reorder Buffer Operand Bus B EU-Bus EU-Bus EU-Bus Memory Unit Execute Execute m Result Bus
Reorder Buffer (ROB) Implemented as a queue: When issuing an operation, an entry is reserved During WB, result is written-back to the reserved entry Commit is done in-order and writes results back to the architectural register speculatively executed operations are committed after preceding branches have been comitted ROB-entries have now the meaning of virtual registers Bypass zu den Result Bus DeMux entry entry 2 entry n Mux To the architectural registers busy to issue-phase (bypass) Reserved entry from first last
Fields of the ROB Structure of a ROB-entry res addr type valid busy Meaning of the fields depends on the operation type Operation types: Branch operation Memory operation ALU operation field/meaning res addr type busy valid Branch operation computed target address (will be stored in the PC) c = speculation was correct w = speculation was wrong 3 entry reserved = result has not been computed yet Memory operation Value to be stored in the memory Address at which the res- Value should be stored ALU operation Result of the operation - 2 = result was computed and is available in the resfield
Reservation Station Fields has the same functionality as in ordinary Tomasulo: Buffers operations Buffers operands But, ROB-entries are used for determining operand source (virtual register) Operation to be executed (e.g. add, sub, mul, ) Qj = x, if ROB-entry x will store value for operand A, otherwise Qk = x, if ROB-entry x will store value for operand B, otherwise Value for operand A Value for operand B Miscellaneous Type of operation (see table in previous slide) Reserved ROB-entry Status in pipeline (RO, EX, WB) is occupied/free Operand Bus A Operation Bus Operand Bus B DeMux opc Qj Qk Vj Vk misc type rob stat busy opc Qj Qk Vj Vk misc type rob stat busy opc Qj Qk Vj Vk misc type rob stat busy Mux Reservation Station EU-Bus
Register File Extensions Mapping of architecture registers to virtual registers (ROB-entries) Architecture register n stores ROB-entry, of the latest operation that is computing the value for n (register renaming) Result Bus Reg Reg Reg 2 Reg r rob rob rob rob Operand Bus A Operand Bus B Example: Reg Reg Reg 2 5 ROB-entry 5 contains result of latest operation with destination register Register is not computed by any operation in the pipeline ROB-entry contains result of latest operation with destination register 2 Reg r
Overview Pipeline Phases Issue Schedule operation from instruction queue to Read operand values or rename registers (solving WAR- und WAW-Hazard) Reserve ROB-entry Issue is in-order Execute Wait for operands to be ready Execute operation as soon as operands are ready and EU is available (solve RAW-Hazard) Execute is out-of-order Write-Back Write result through result bus into reserved ROB-entry WB is out-of-order Commit Write results from ROB in order into destination registers/memory Commit is in-order
Overview Pipeline Phases (Issue) Issue operation from instruction queue to, if: is free and ROB not full Otherwise: Stall issue stage Allocate - and ROB-entry Read operands, if present in register file, or present in ROB ROB-entry corresponds to a virtual register Programmspeicher Op A PC Reg Reg Reg 2 Reg r reservierter Platz für Op A Reorder Buffer Op A Memory Unit Execute Execute m
Overview Pipeline Phases (Execute) Operation is waiting in for operands and free EU Execute operation as soon as all operands are available and EU is free can store state of operation during execution Programmspeicher PC Reg Reg Reg 2 reservierter Platz für Op A Reorder Buffer Reg r Op A Memory Unit Execute Op A Execute m
Overview Pipeline Phases (Write-Back) Write result into reserved ROB-entry ROB-entry ID has been stored in the rob-field of the Result is forwarded to all waiting through the result bus (value is identified by its ROB ID) Free Programmspeicher PC Reg Reg Reg 2 reservierter Platz für Ergebnis Op A Reorder Buffer Reg r Memory Unit Execute Op A Execute m
Overview Pipeline Phases (Commit) Write results from the first entry in the ROB into the corresponding destination register Free ROB-entry Programmspeicher PC Reg Ergebnis Reg Reg 2 reservierter Platz für Op A Reorder Buffer Reg r Memory Unit Execute Execute m
For the operation teat will be issued let denote: opc operation type (add, sub, mul, ) src, src2 source registers dst destination registers Operation can be issued, if there exists an x, where [x].busy = and ROB[last].busy = Update after issue Issue-Phase Details (for ALU-operations) if Reg[src].rob = then // determine value of left operand [x].qj := ; [x].vj := Reg[src] // read left operand from the register file else // left operand is still under computation or in ROB if ROB[Reg[src].rob].valid = then [x].qj := ; [x].vj := ROB[Reg[src].rob].res // read operand from ROB else [x].qj := Reg[src].rob // wait for operand in fi if Reg[src2].rob = then // the same for the right operand [x].busy := ; [x].rob := tail [x].opc := opc; [x].type := ; [x].status := RO
Issue-Phase Details (Example ) Situation: Op A can be issued Value for r is taken from the register file Value for r2 is taken from the ROB res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Update if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[srcy].Qj/k fi OP A PC R: 5 R: 4 R2: 89 R3: 7 4 : 2: 3: 4: [56,-,,,] 5: [-,-,-,,] Memory Unit Execute Execute m
Issue-Phase Details (Example ) Situation: Op A was issued and ROB-entry 5 was allocated add r <- r, r2 sub r3 <- r, r // Op A // Op B res addr type valid busy Update after issue if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[srcy].Qj/k fi PC R: 5 R: 4 R2: 89 R3: 7 5 4 : 2: 3: 4: [56,-,,,] 5: [-,-,,,] [add,,,4,56,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Issue-Phase Details (Example 2) Situation: issue of Op A Value of r is read from the register file Value in r2 is computed by 4 res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Update after issue: if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[src].Qj/k fi OP A PC R: 5 R: 4 R2: 89 R3: 7 4 : 2: 3: 4: [-,-,,,] 5: [-,-,-,,] [ld,,,,-,-,2,4,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Issue-Phase Details (Example 2) Situation: Op A was issued Uses ROB-entry 5 Has to wait for the value from r res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Update after issue: if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[src].Qj/k fi PC R: 5 R: 4 R2: 89 R3: 7 4 : 2: 3: 4: [-,-,,,] 5: [-,-,,,] [add,,4,4,-,-,,5,ro,] [ld,,,,-,-,2,4,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Execute Details Executing an operation from a is possible, if [x].status = RO and [x].qj = and [x].qk = Update after start of execution: Perform computation with [x].vj and [x].vk [x].status := EX Update after end of execution: [x].vj := res // Store result temporary in the reservation station [x].status := WB
Execute-Phase Details (Example 3) Both operands are ready: [x].status = RO and [x].qj = und [x].qk = res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Programmspeicher OP A PC Reg Reg Reg 2 Reg r 4 : 2: 3: 4: [-,-,,,] 5: [ld,,,,-,-,2,4,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Execute-Phase Details (Example 3) Operation is executed: [x].status = EX add r <- r, r2 sub r3 <- r, r // Op A // Op B res addr type valid busy Programmspeicher OP A PC Reg Reg Reg 2 Reg r 4 : 2: 3: 4: [-,-,,,] 5: [ld,,,,-,-,2,4,ex,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Execute-Phase Details (Example 3) Result is computed: Result will be stored temporarily in the field Vj [x].status = WB Result is ready for WB Programmspeicher OP A PC res add r <- r, r2 sub r3 <- r, r Reg Reg Reg 2 Reg r 4 // Op A // Op B addr type valid busy : 2: 3: 4: [-,-,,,] 5: [ld,,,89,-,-,2,4,wb,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Write-Back Details (ALU-Operation) Write-Back of the result res from x possible, if [x].status = WB and Result bus available Update after WB: ROB[[x].rob] := [x].vj [x].busy := ROB[[x].rob].valid := // Write result to allocated ROB-entry // free // Declare ROB-entry as valid for all reservation stations y ¹ x: // Forwarding of the result if [y].qj = [x].rob then [y].vj := [x].rob; [y].qj := if [y].qk = [x].rob then [y].vk := [x].rob; [y].qk :=
Write-Back-Phase Details (Example 4) Situation: Result of the ld-operation is written back Result bus contains: ROB-entry ID, e.g. 4 ROB-value, e.g. 89 add-operation waits for the right-hand operand Programmspeicher OP B PC res add r <- r, r2 sub r3 <- r, r R: 5 R: 4 R2: 89 R3: 7 // Op A // Op B addr type valid busy 5 4 : 2: 3: 4: [-,-,,,] 5: [-,-,,,] [add,,4,4,-,-,,5,ro,] [ld,,,2,-,-,2,4,wb,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy (4,89)
Write-Back-Phase Details (Example 4) Situation: Result was stored in ROB-entry 4 containing add-operation has also stored the result was freed Programmspeicher OP B PC res add r <- r, r2 sub r3 <- r, r R: 5 R: 4 R2: 89 R3: 7 // Op A // Op B addr type valid busy 5 4 : 2: 3: 4: [2,-,,,] 5: [-,-,,,] [add,,,4,2,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Commit Details (ALU-Operation) It must be checked: ROB[first].valid = Update by commit: for all Architectural Registers r with Reg[r].rob = head do Reg[r] := ROB[head].res Reg[r].rob :=
Commit-Phase Details (Example 5) Situation: Let be head = 4 for the ROB-head R2 waits for result from ROB-entry 4 res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy (4,89) Programmspeicher OP B PC R: 5 R: 4 R2: 89 R3: 7 5 4 : 2: 3: 4: [2,-,,,] 5: [-,-,,,] [add,,,56,89,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Commit-Phase Details (Example 5) Situation: R2 has received result from ROB res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy (4,89) Programmspeicher OP B PC R: 5 R: 4 R2: 2 R3: 7 5 : 2: 3: 4: [2,-,,,] 5: [-,-,,,] [add,,,56,89,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Executing Branch Operations Issue: Vk-field: stores branch target z misc-field: remembers address a of the branch operation misc-field: also remembers which address ( z or a ) was predicted Execute: Computed target address is stored in Vk-Field of : Vk := z, if branch is taken Vk := a+, if branch is not taken misc-field stores, whether or not prediction was correct ( c = correct; w = wrong) Write-Back: res-field of ROB received branch target (Vk-field of the ) addr-field receives value of misc-field from : c or w Commit: If addr-field = c, nothing must be done (operations were fetched from correct address) If addr-field = w, then copy res-field into PC and flush the whole pipeline: All subsequent ROB-entries All -entries instruction queue
Branch Details (Example 6 correct prediction) Situation: Branch-operation was issued to 2 Branch depends on ld-operation Op A, Op B, will be executed speculateively : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP C OP B OP A PC 2 R: 5 R: 4 R2: 89 R3: : [-,-,2,,] 2: [-,-,3,,] 3: [-,-,-,,] 4: [-,-,-,,] 5: [-,-,-,,] [ld,,,-,2,-,2,,ex,] [bz,,,-,23,2a,3,2,ro,] Memory ld Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 6 correct prediction) Situation: Op A is executed speculatively Op B waits for result of Op A : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 89 R3: 3 4 : [-,-,2,,] 2: [-,-,3,,] 3: [-,-,,,] 4: [-,-,,,] 5: [-,-,-,,] [ld,,,-,2,-,2,,ex,] [bz,,,-,23,2a,3,2,ro,] OP A OP B Memory ld Unit OP A Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 6 correct prediction) Situation: Op A wrote result to ROB, but not to R Op B is executed speculatively : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 89 R3: 3 4 : [-,-,2,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-,-,,,] 5: [-,-,-,,] [ld,,,-,2,-,2,,ex,] [bz,,,-,23,2a,3,2,ro,] OP B Memory ld Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 6 correct prediction) Situation: Op B wrote result to ROB : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: ld write result to ROB res addr type valid busy bz can be executed Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 89 R3: 3 4 : [6,-,2,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] [bz,,,6,23,2a,3,2,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 6 correct prediction) Situation: bz will be executed: Branch is not taken Commit for ld-operation is done : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] [bz,,,6,23,2a,3,2,ex,] Memory Unit BZ Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 6 correct prediction) Situation: bz was executed : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: WB for bz was done res addr type valid busy Prediction was correct (ROB[2].addr := c) Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [3,c,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 6 correct prediction) Situation: Commit of the branch operation does not require any action, because prediction was correct : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Now Commit can be done for speculatively executed operations A and B Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [-,-,-,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 7 wrong prediction) Situation: Same situation as in example 6 : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: But, ld-operation has stored in R2 res addr type valid busy I.e., branch is taken Programmspeicher OP F OP E OP D PC R: 5 R: 4 R2: R3: 3 4 : [-,-,-,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] [bz,,,,23,2a,3,2,ex,] OP C Memory Unit BZ Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 7 wrong prediction) Situation: bz-operation was executed : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: Prediction was wrong (ROB[2].addr := w) res addr type valid busy Correct target can be found in the res-field Programmspeicher OP G OP F OP E PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [23,w,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] OP C OP D Memory Unit OP C Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Branch Details (Example 7 wrong prediction) Situation: Commit of the branch moves correct address into PC : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Flushing the pipeline Programmspeicher OP G OP F OP E PC 23 R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [-,-,-,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] OP C OP D Memory Unit OP C Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Executing Memory Operations For out-of-order-execution of memory operations holds: Ordering of load-does not matter Ordering of load- and store-operations as well as of store- and storeoperations must be maintained Example: ld r2 <- (r) ld r <- (r4) st r4 -> (r) ld r5 <- (r6) st r7 -> (r8) Strategy: Writing to memory takes place during commit-phase (in-order) Reading from memory takes place during execute-phase (out-of-order) But only, if valid-field of all preceding write-operations in the ROB is
Example (store-operation) Issue-Phase: issue the first st-operation st r3 -> (r) ld r3 <- (r) st r-> (r2) Execute-Phase Execution of store-operation can start, if both source operands are available Execution has no effect Rather, WB of st-operation starts Programmspeicher immediately M PC res R: 5 R: 4 R2: 2 R3: 7 addr type valid busy : [-,-,2,,] 2: 3: 4: 5: [st,,,7,5,-,2-,,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Example (store-operation) Updates during WB of the st-operation ROB[x].res := [y].vj ROB[x].addr := [y].vk st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 : [-,-,2,,] 2: 3: 4: 5: [st,,,7,5,-,2-,,wb,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Example (store-operation) Commit for st-operation MEM[ROB[first].addr] := ROB[first].res st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 : [7,5,2,,] 2: 3: 4: 5: Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Suppose first st-operation was issued and waits for execution Then ld-operation was issued, and its source operands are available Example (load-operation) Programmspeicher M PC st r3 -> (r) ld r3 <- (r) st r-> (r2) res R: 5 R: 4 R2: 2 R3: 7 addr type valid busy 2 : [-,-,2,,] 2: [-,-,2,,] 3: 4: 5: OP C [st,5,,-,5,-,2,,ro,] [ld,,,4,-,-,2,2,ro,] OP C Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Example (load-operation) Situation : ld-operation is not executed, because valid-bit of first st-operation is st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 2 : [-,-,2,,] 2: [-,-,2,,] 3: 4: 5: OP C [st,5,,-,5,-,2,,ro,] [ld,,,4,-,-,2,2,ro,] OP C Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Example (load-operation) Situation: Now, ld-operation can be executed (see valid-bit of first st-operation) st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy ld-operation can read value either from memory or from ROB (if addr-field Programmspeicher of a preceding st-operation matches Vj-field of ld-operation M PC R: 5 R: 4 R2: 2 R3: 7 2 : [7,5,2,,] 2: [-,-,2,,] 3: 4: 5: [ld,,,4,-,-,2,2,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Example (load-operation) WB for ld-operation complete Commit-phase for ld-operations is the same as for alu-operations st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 2 : [7,5,2,,] 2: [2,-,2,,] 3: 4: 5: [ld,,,4,-,-,2,2,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle ) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC add r3,r3,r4 mul r4,r,r2 ld r,(r) R: R: R2: 3 R3: R4: : 2: 3: 4: 5: 2 3 4 5 6 Memory Unit Execute Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle ) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC add r,r, R: R: R2: 3 : 2: 3: ld r,() add r3,r3,r4 mul r4,r,r2 R3: R4: 4: 5: ld r,() 2 3 4 5 6 Memory Unit Execute Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 2) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC bne loop,r,2 R: R: R2: 3 : 2: 3: ld r,() mul r4 add r,r, add r3,r3,r4 R3: R4: 2 4: 5: ld r,() 2 mul r4,rob,3 3 4 5 6 Memory ld r,() Unit Execute Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 3) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC ld r,(r) R: R: R2: 3 : 2: 3: ld r,() mul r4 add r3 bne loop,r,2 add r,r, R3: 3 R4: 2 4: 5: ld r,() 2 mul r4,rob,3 add r3,,rob2 3 4 5 6 Memory ld r,() Unit Execute Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 4) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC mul r4,r,r2 R: R: R2: 3 4 : 2: 3: 2 mul r4 add r3 ld r,(r) bne loop,r,2 R3: 3 R4: 2 4: 5: add r ld r,() 2 mul r4,2,3 add r3,,rob2 3 4 add r,, 5 6 Memory Unit Execute Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 5) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 add r3,r3,r4 R: 2 R: R2: 3 4 : 2: 3: mul r4 add r3 mul r4,r,r2 ld r,(r) R3: 3 R4: 2 4: 5: add r bne 2 mul r4,2,3 add r3,,rob2 3 4 add r,, 5 bne loop,rob4,2 6 Memory Unit EU-Bus mul Execute r4,2,3 add Execute r,, 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 6) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 Operations are fetched and issued speculatively add r,r, add r3,r3,r4 mul r4,r,r2 R: 2 R: R2: 3 4 R3: 3 R4: 2 : 2: 3: 4: 5: ld r 26 add r3 bne Operations from different loop iterations are in the pipeline ld r,() 2 mul r4,2,3 add r3,,26 3 4 add r,, bne loop,,2 5 6 ld-operation is no longer dependent on the branch operation Memory Unit Execute Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 7) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 bne loop,r,2 R: 2 R: R2: 3 4 : 2: 3: ld r mul r4 add r3 add r,r, add r3,r3,r4 R3: 3 R4: 26 2 4: 5: bne Now speculative execution possible ld r,() 2 mul r4,rob,3 add r3,,26 3 4 bne loop,,2 5 6 Memory ld r,() Unit add Execute r3,,26 bne loop,,2 Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 8) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 bne loop,r,2 add r,r, R: 2 R: R2: 3 R3: 4 3 : 2: 3: 4: ld r mul r4 26 add r3,r3,r4 R4: 26 2 5: bne c ld r,() 2 mul r4,rob,3 add r3,,26 3 4 bne loop,,2 5 6 Memory ld r,() Unit Execute Execute 2
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 9) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 ld r,(r) bne loop,r,2 R: 2 R: R2: 3 R3: 26 4 3 : 2: 3: 4: 8 mul r4 add r3 add r,r, R4: 26 2 5: bne c ld r,() 2 mul r4,8,3 add r,, 3 4 add r3,26,rob2 5 6 Memory Unit Execute Execute 2
Summary We have seen Tomasulo-algorithm with speculation Importance of the Reorder-Buffer Execution of Alu-operations Branch-operations Memory-operations But: Issue- and Commit-phase are limited to processing a single operation per clock cycle
Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading
Superscalar Instruction Pipeline So far: Only data path is superscalar Parallel execution of operation in the data path, but CPI < not possible Required: super scalar Fetch-, Issue-, WB-, Commit-Phase Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 2 ROB Memory Unit Execute Execute m
Superscalar Fetch-Phase Fetch: Fetching n operations simultaneously from code cache/memory Requires wider busses Cache/Memory n operation bus n operation bus Instruction queue Register File operand bus A A n operand bus B operand bus B n
Superscalar Issue-Phase Issue: Issue the first n Operations from the instruction queue (n operation busses required) n operand busses for left operand required (A) n operand busses for right operand required (B) Checking for free and free ROB-entry must be done simultaneously for up to n operations! Cache/Memory n operation bus n operation bus Instruction queue Register File operand bus A A n operand bus B operand bus B n
Implementing simultaneous checking in issue-phase For a single operations For two operations Old ROB Status Old Status Old ROB Status Old Status RF Control for operand buses A and B New state for ROB Issue-Logic ( Operation) New state for RF Control for operand buses A and B Issue-Logic (. Op) RF control for operand busses A2 und B2 Issue-Locik (2. Op) Combine New state for ROB New state for
Superscalar WB-Phase Every EU has its own result bus E i All EUs may write simultaneously to the ROB This makes also the bypass for the reservation stations more complex A A n B B n E E m R R k ROB Bypass opc Qj Qk Vj Vk misc type rob stat busy Bypasses to Memory Unit Execute Execute m E E E m result busses
Superscalar Commit-Phase For up to n ROB-entries starting at the head: check if the valid-bit is set to Then write their result to the register file Register file needs n write-ports A A n B B n E E m R R k Bypass Register File ROB opc Qj Qk Vj Vk misc type rob stat busy Bypasses to Memory Unit Execute Execute m E E E m result busses
Example PowerPC Quelle: PowerPC e5 Core Family Reference Manual
Limitations for ILP Memory band width limits the amount of simultaneously fetched operations (typical 4 to 6 operations) HW-Overhead and delay for: Control logic issue-phase Bypasses for reservation stations Number of read-/write-ports in the register file Branches Possible Solution: Branch prediction Available parallelism in the application Possible Solution: Multithreading
Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading
Motivation for Multithreading True dependencies prevent the EUs from being used in parallel (horizontal performance loss) Operations with a very long delay during execution create vertical performance loss E.g. memory access of an operation A in a Pentium 4 (3-way-superscalar) can take 38 clock cycles (cache misses) I.e. 4 operations have to bypass Op A in order to utilize EUs fully during this time But: Reorder buffer has only 26 entries Hence, 339 execution cycles are wasted Solution Multithreading: Execute multiple threads that share the same execution units, but have no dependencies OP OP 2 OP n OP OP 2 OP 3 OP A OP 4 OP 5 OP 6 after 38 cycles WB of A only 4 cycles EU usage 4 2 5 3 EU usage 2 3 A 4 5 22 23 24
Process vs. Thread Each process has its own context address space (Code, data, heap, stack) TLB Switching between processes takes tens of thousands of clock cycles (context switch) Threads share the same context Switching between two threads only requires to change the values in the architectural registers OS is involved Code Section Code Section Code Section Code Section Data Section Data Section Data Section Data Section Heap Heap Heap Heap Stack 2 Stack Stack 2 Stack Stack Stack Thread in Process 2 Thread 2 in Process 2 Process Process 2
Multithreading Programmspeicher PC PC 2 OP E OP C OP A Instruction queue OP F OP D OP B Instruction queue RF RF 2 ROB ROB 2 EU-Bus Memory Unit Execute Execute m Multithreading: A fixed number of n threads can share the same execution units Hardware supports fast switching between n threads: n copies of some resources, e.g. architectural registers (including PC) fix partitioning of some resources, e.g. (or limited sharing) shared usage of some resources, e.g. EUs
Multithreading Types of Multithreading: no MT Coarse Grained MT Fine Grained MT Simultaneous MT
Coarse Grained Multithreading A single thread runs for many clock cycles before the hardware switches to another thread Hardware switches between threads only, if a long running operation is detected, e.g. cache miss, or a fix time slice has passed A processor with n-way MT appears to an operating system like n processors OS schedules n threads of the same process to these processors
Example Two threads are scheduled to the processor Reservation stations and EUs are shared resources Hardware switches between both PCs and IQ (e.g. by multiplexors) Fetched operations are tagged with thread number Situation: Thread is running Instruction Queue thread Programmspeicher OP F. OP C.2 OP E. OP B.2 OP D. OP A.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. ROB 2 OP C. OP A. OP B. Memory Unit OP A. Execute OP B. Execute m
Example Memory operation D. of thread was issued Thread is still running Instruction Queue thread Programmspeicher OP G. OP F. OP E. OP C.2 OP B.2 OP A.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. ROB 2 OP C. OP D. OP A. OP B. Memory Unit OP A. Execute OP B. Execute m
Example Memory operation is executed and cache miss is detected Processor has switched to thread 2 another PC is used another instruction queue is used Instruction Queue thread Programmspeicher OP H. OP G. OP F. OP C.2 OP B.2 OP A.2 2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. OP E. ROB 2 OP C. OP E. OP D. OP A. OP D. Memory Unit OP A. Execute Execute m
Example Issued operations of thread are further processed But issue now takes place from instruction queue 2 2 Instruction Queue thread Programmspeicher OP H. OP D.2 OP G. OP F. OP C.2 OP B.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. OP E. OP A.2 ROB 2 OP C. OP E. OP D. OP A. OP A.2 OP D. Memory Unit OP A. Execute Execute m
Example Issued operations of thread are further processed But issue now takes place from instruction queue 2 2 Instruction Queue thread Programmspeicher OP H. OP G. OP E.2 OP D.2 OP F. OP C.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. OP E. OP A.2 OP B.2 ROB 2 OP B.2 OP C. OP E. OP D. OP A.2 OP D. Memory Unit Execute OP A.2 Execute m
Example Operations of Thread are further processed, but not committed while simultaneously operations from Thread 2 are processed If operation E. has a true-dependency to D. then it blocks the reservation station for operations from Thread 2 Balancing between shared resources important Instruction Queue thread Programmspeicher OP H. OP G. OP F. OP F.2 OP E.2 OP D.2 2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP B. ROB OP C. OP D. OP E. OP A.2 OP B.2 ROB OP C.2 2 OP B.2 OP C. OP E. OP D. OP C.2 OP D. Memory Unit Execute Execute m
Coarse Grained MT - Limitations Does not help to overcome the problem of horizontal performance loss (a single thread may not have enough ILP) Only right after switching between threads, there are operations of both threads simultaneously processed Switching between threads may has a negative impact on the cache hit rate for each thread and affects the performance negatively
Fine-Grained Multithreading Processor switches in every clock cycle to another thread E.g. in a round robin manner: This helps to overcome horizontal performance loss A single instruction queue and a single reorder buffer are sufficient (shared) Operations must be tagged with the corresponding Thread number
Example Programmspeicher PC PC 2 OP A. Instruction queue RF RF 2 ROB Memory Unit Execute Execute m
Example 2 Programmspeicher PC PC 2 OP A.2 OP A. Instruction queue RF RF 2 ROB Memory Unit Execute Execute m
Example Programmspeicher PC PC 2 OP B. OP A.2 OP A. Instruction queue RF RF 2 ROB Memory Unit Execute Execute m
Example 2 Programmspeicher PC PC 2 OP A. OP B.2 OP B. OP A.2 Instruction queue RF RF 2 ROB OP A. Memory Unit Execute Execute m
Example Programmspeicher OP C. OP B.2 OP B. Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 ROB OP A.2 Memory Unit OP A. Execute Execute m
Example 2 Programmspeicher OP C.2 OP C. OP B.2 Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 OP ROB B. OP B. Memory Unit OP A. Execute OP A.2 Execute m
Example Programmspeicher OP D. OP C.2 OP C. Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 OP ROB B. OP B.2 OP B. OP B.2 Memory Unit OP A. Execute Execute m
Example 2 Programmspeicher OP D.2 OP D. OP C.2 Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 OP ROB B. OP B.2 OP C. OP C. Memory Unit OP B. Execute OP B.2 Execute m
Example Programmspeicher OP D.2 OP D. OP C.2 Instruction queue PC PC 2 RF RF 2 OP A.2 OP ROB B. OP B.2 OP C. OP C. Memory Unit OP B. Execute OP B.2 Execute m
Fine-Grained Multithreading - Limitations Vertically performance loss cannot be avoided A long running operation prevents other operation from the same thread from being executed due to the shared IQ and ROB, also the other thread is blocked after a while Improvement: Stop fetching for a blocked thread Performance of a single thread is reduced (even if there are no operations from a second blocked thread), because issue takes place in every second cycle MT reduces cache performance
Simultaneous Multithreading Mixing Coarse- and Fine-Grained MT In every clock cycle operations from n threads will be fetched and issued (Intel calls this Hyperthreading) Operations must be tagged with the corresponding Thread number Solving the problem of having either horizontal or vertical performance loss: If both threads are not blocked, then available ILP is utilized, and horizontal performance loss is avoided If one thread is blocked, then the other thread still uses the resources, and vertical performance loss is avoided (but not horizontal one) Even if one thread is blocked, the other one can run at full speed (issue in every clock cycle)
Example Fetch and issue takes place simultaneously for both threads Each thread has its own IQ, RF, PC, ROB Reservation Stations are partitioned Programmspeicher OP E OP C OP A Instruction queue OP F OP D OP B Instruction queue 2 PC PC 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 Memory Unit Execute Execute m
Example Both threads are executed... Avoids horizontal performance loss Programmspeicher PC PC 2 OP G OP E OP C Instruction queue OP H OP F OP D Instruction queue 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 OP B OP A Memory Unit Execute Execute m
Example Both threads are executed... Avoids horizontal performance loss Programmspeicher PC PC 2 OP I OP G OP E Instruction queue OP J OP H OP F Instruction queue 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 OP C OP B OP A OP D Memory Unit OP B Execute OP A Execute m
Example Both threads are executed... Avoids horizontal performance loss But, now long running operation E is issued Programmspeicher PC PC 2 OP A OP B OP K OP I OP G Instruction queue OP L OP J OP H Instruction queue 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 OP E OP C OP D OP F Memory Unit OP C Execute OP D Execute m
Example Assume G is true dependent on E Programmspeicher OP M OP K OP I Instruction queue OP N OP L OP J Instruction queue 2 PC PC 2 Res A RF RF 2 Res B Res C ROB Res D ROB 2 Used for thread Used for thread 2 OP E OP G OP H OP F OP E Memory Unit Execute OP F Execute m
Example Assume G is true dependent on E; and I, too Then, thread is now blocked Programmspeicher OP O OP M OP K Instruction queue OP P OP N OP L Instruction queue 2 PC PC 2 Res A RF RF 2 Res B Res C ROB ROB Res F 2 Used for thread Used for thread 2 OP E OP G OP H OP I OP J OP E Memory Unit OP H Execute Execute m
Example but thread 2 can continuous Programmspeicher OP O OP M OP K Instruction queue OP Q OP P OP N Instruction queue 2 PC PC 2 Res A RF RF 2 Res B Res C ROB ROB 2 Used for thread Used for thread 2 OP E OP G OP H OP I OP J OP L OP E Memory Unit OP H Execute Execute m
Summary - Multithreading Allows to fill the pipeline with operations from different threads no data dependency between operations from different threads allows for higher resource utilization Coarse-grained MT suffers from horizontal performance loss Fine-grained MT suffers from horizontal performance loss SMT solves these problems Improvement: Balancing between partitioned resources All MT approaches have impact on the cache performance In particular Fine-Grained MT can be also used in statically scheduled processor pipelines to avoid hazards In a pipeline with n pipeline stages, operations from n threads are issued no data-/control hazard occur because operations in the pipeline have no dependencies