Processor Architecture

Size: px
Start display at page:

Download "Processor Architecture"

Transcription

1 Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel

2 Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading

3 Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading

4 Control Flow Dependencies Let b a conditional branch at address a with branch taget z. An operation c ist control flow dependent on b, if the execution of c depends on the branch of b. Otherwise c is not control flow dependent. Examples: a: c b a: b a+: z: a: b a+: z: b2 c a+: z: c d c is not control flow dependent on b c is not control flow dependent on b c is control flow dependent on b and b2 What about d?

5 Scheduling restrictions imposed by control flow dependencies for control flow dependent operations: cannot be moved before the branch for not control flow dependent operations: cannot be moved behind a branch b c b c b b c c Program order Speculative Execution of c Program order c may be not executed

6 Performance Problem due to Control Hazards Problem: Branch target of an operation is only known after execution Long pipeline stalls required in processors with deep pipelines Instruction Queue Speicher b PC Branch operation? Address for next instruction fetch is not known Solution: to the reservation stations Branch prediction helps, but is limitted Tomasulo supports speculative fetch, issue, but not execute of operations

7 Drawbacks of Speculative Execution What happens if an operation is executed speculatively and speculation was wrong? May affect the data flow May affect the exception behavior block to be executed b speculatively executed block c b c after dynamic scheduling executed program control flow graph of a program

8 Example Affected Data Flow c is executed speculatively before b mul-operation now receives the value in r from sub- instead of from addoperation Affected Exceptions Behavior c is executed speculatively before b Division by possible a c b add r <- r2,r3 sub r <- r4,r5 a c b div r <- r4,r5 if r5 = then x else y does not write in r c x: No division y: executed c mul r <- r,r6

9 Solution Divide WB-Phase from Tomasulo-Algorithm into two phases: Forwarding results from EU to s (WB-Phase) Writing results into architectural registers/memory (Commit-Phase) Implemented by: Reorder-Buffer for buffering results from WB-phase Committing buffered results from the Reorder-Buffer in-oder By this: Usage of speculative results possible, without modifying architectural registers/memory locations

10 Architecture for Tomasulo with speculative Execution Operand Bus A Program Memory Instruction Queue PC Operation Bus Reg Reg Reg 2 Reg r Architecture Register Reorder Buffer Operand Bus B EU-Bus EU-Bus EU-Bus Memory Unit Execute Execute m Result Bus

11 Reorder Buffer (ROB) Implemented as a queue: When issuing an operation, an entry is reserved During WB, result is written-back to the reserved entry Commit is done in-order and writes results back to the architectural register speculatively executed operations are committed after preceding branches have been comitted ROB-entries have now the meaning of virtual registers Bypass zu den Result Bus DeMux entry entry 2 entry n Mux To the architectural registers busy to issue-phase (bypass) Reserved entry from first last

12 Fields of the ROB Structure of a ROB-entry res addr type valid busy Meaning of the fields depends on the operation type Operation types: Branch operation Memory operation ALU operation field/meaning res addr type busy valid Branch operation computed target address (will be stored in the PC) c = speculation was correct w = speculation was wrong 3 entry reserved = result has not been computed yet Memory operation Value to be stored in the memory Address at which the res- Value should be stored ALU operation Result of the operation - 2 = result was computed and is available in the resfield

13 Reservation Station Fields has the same functionality as in ordinary Tomasulo: Buffers operations Buffers operands But, ROB-entries are used for determining operand source (virtual register) Operation to be executed (e.g. add, sub, mul, ) Qj = x, if ROB-entry x will store value for operand A, otherwise Qk = x, if ROB-entry x will store value for operand B, otherwise Value for operand A Value for operand B Miscellaneous Type of operation (see table in previous slide) Reserved ROB-entry Status in pipeline (RO, EX, WB) is occupied/free Operand Bus A Operation Bus Operand Bus B DeMux opc Qj Qk Vj Vk misc type rob stat busy opc Qj Qk Vj Vk misc type rob stat busy opc Qj Qk Vj Vk misc type rob stat busy Mux Reservation Station EU-Bus

14 Register File Extensions Mapping of architecture registers to virtual registers (ROB-entries) Architecture register n stores ROB-entry, of the latest operation that is computing the value for n (register renaming) Result Bus Reg Reg Reg 2 Reg r rob rob rob rob Operand Bus A Operand Bus B Example: Reg Reg Reg 2 5 ROB-entry 5 contains result of latest operation with destination register Register is not computed by any operation in the pipeline ROB-entry contains result of latest operation with destination register 2 Reg r

15 Overview Pipeline Phases Issue Schedule operation from instruction queue to Read operand values or rename registers (solving WAR- und WAW-Hazard) Reserve ROB-entry Issue is in-order Execute Wait for operands to be ready Execute operation as soon as operands are ready and EU is available (solve RAW-Hazard) Execute is out-of-order Write-Back Write result through result bus into reserved ROB-entry WB is out-of-order Commit Write results from ROB in order into destination registers/memory Commit is in-order

16 Overview Pipeline Phases (Issue) Issue operation from instruction queue to, if: is free and ROB not full Otherwise: Stall issue stage Allocate - and ROB-entry Read operands, if present in register file, or present in ROB ROB-entry corresponds to a virtual register Programmspeicher Op A PC Reg Reg Reg 2 Reg r reservierter Platz für Op A Reorder Buffer Op A Memory Unit Execute Execute m

17 Overview Pipeline Phases (Execute) Operation is waiting in for operands and free EU Execute operation as soon as all operands are available and EU is free can store state of operation during execution Programmspeicher PC Reg Reg Reg 2 reservierter Platz für Op A Reorder Buffer Reg r Op A Memory Unit Execute Op A Execute m

18 Overview Pipeline Phases (Write-Back) Write result into reserved ROB-entry ROB-entry ID has been stored in the rob-field of the Result is forwarded to all waiting through the result bus (value is identified by its ROB ID) Free Programmspeicher PC Reg Reg Reg 2 reservierter Platz für Ergebnis Op A Reorder Buffer Reg r Memory Unit Execute Op A Execute m

19 Overview Pipeline Phases (Commit) Write results from the first entry in the ROB into the corresponding destination register Free ROB-entry Programmspeicher PC Reg Ergebnis Reg Reg 2 reservierter Platz für Op A Reorder Buffer Reg r Memory Unit Execute Execute m

20 For the operation teat will be issued let denote: opc operation type (add, sub, mul, ) src, src2 source registers dst destination registers Operation can be issued, if there exists an x, where [x].busy = and ROB[last].busy = Update after issue Issue-Phase Details (for ALU-operations) if Reg[src].rob = then // determine value of left operand [x].qj := ; [x].vj := Reg[src] // read left operand from the register file else // left operand is still under computation or in ROB if ROB[Reg[src].rob].valid = then [x].qj := ; [x].vj := ROB[Reg[src].rob].res // read operand from ROB else [x].qj := Reg[src].rob // wait for operand in fi if Reg[src2].rob = then // the same for the right operand [x].busy := ; [x].rob := tail [x].opc := opc; [x].type := ; [x].status := RO

21 Issue-Phase Details (Example ) Situation: Op A can be issued Value for r is taken from the register file Value for r2 is taken from the ROB res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Update if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[srcy].Qj/k fi OP A PC R: 5 R: 4 R2: 89 R3: 7 4 : 2: 3: 4: [56,-,,,] 5: [-,-,-,,] Memory Unit Execute Execute m

22 Issue-Phase Details (Example ) Situation: Op A was issued and ROB-entry 5 was allocated add r <- r, r2 sub r3 <- r, r // Op A // Op B res addr type valid busy Update after issue if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[srcy].Qj/k fi PC R: 5 R: 4 R2: 89 R3: : 2: 3: 4: [56,-,,,] 5: [-,-,,,] [add,,,4,56,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

23 Issue-Phase Details (Example 2) Situation: issue of Op A Value of r is read from the register file Value in r2 is computed by 4 res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Update after issue: if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[src].Qj/k fi OP A PC R: 5 R: 4 R2: 89 R3: 7 4 : 2: 3: 4: [-,-,,,] 5: [-,-,-,,] [ld,,,,-,-,2,4,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

24 Issue-Phase Details (Example 2) Situation: Op A was issued Uses ROB-entry 5 Has to wait for the value from r res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Update after issue: if Reg[srcy].rob = then [x].qj/k := ; [x].vj/k := Reg[srcy] Programmspeicher else if ROB[Reg[srcy].rob].valid = then [x].qj/k := ; [x].vj/k := ROB[Reg[srcy].rob].res else [x].qj/k := Reg[src].Qj/k fi PC R: 5 R: 4 R2: 89 R3: 7 4 : 2: 3: 4: [-,-,,,] 5: [-,-,,,] [add,,4,4,-,-,,5,ro,] [ld,,,,-,-,2,4,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

25 Execute Details Executing an operation from a is possible, if [x].status = RO and [x].qj = and [x].qk = Update after start of execution: Perform computation with [x].vj and [x].vk [x].status := EX Update after end of execution: [x].vj := res // Store result temporary in the reservation station [x].status := WB

26 Execute-Phase Details (Example 3) Both operands are ready: [x].status = RO and [x].qj = und [x].qk = res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy Programmspeicher OP A PC Reg Reg Reg 2 Reg r 4 : 2: 3: 4: [-,-,,,] 5: [ld,,,,-,-,2,4,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

27 Execute-Phase Details (Example 3) Operation is executed: [x].status = EX add r <- r, r2 sub r3 <- r, r // Op A // Op B res addr type valid busy Programmspeicher OP A PC Reg Reg Reg 2 Reg r 4 : 2: 3: 4: [-,-,,,] 5: [ld,,,,-,-,2,4,ex,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

28 Execute-Phase Details (Example 3) Result is computed: Result will be stored temporarily in the field Vj [x].status = WB Result is ready for WB Programmspeicher OP A PC res add r <- r, r2 sub r3 <- r, r Reg Reg Reg 2 Reg r 4 // Op A // Op B addr type valid busy : 2: 3: 4: [-,-,,,] 5: [ld,,,89,-,-,2,4,wb,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

29 Write-Back Details (ALU-Operation) Write-Back of the result res from x possible, if [x].status = WB and Result bus available Update after WB: ROB[[x].rob] := [x].vj [x].busy := ROB[[x].rob].valid := // Write result to allocated ROB-entry // free // Declare ROB-entry as valid for all reservation stations y ¹ x: // Forwarding of the result if [y].qj = [x].rob then [y].vj := [x].rob; [y].qj := if [y].qk = [x].rob then [y].vk := [x].rob; [y].qk :=

30 Write-Back-Phase Details (Example 4) Situation: Result of the ld-operation is written back Result bus contains: ROB-entry ID, e.g. 4 ROB-value, e.g. 89 add-operation waits for the right-hand operand Programmspeicher OP B PC res add r <- r, r2 sub r3 <- r, r R: 5 R: 4 R2: 89 R3: 7 // Op A // Op B addr type valid busy 5 4 : 2: 3: 4: [-,-,,,] 5: [-,-,,,] [add,,4,4,-,-,,5,ro,] [ld,,,2,-,-,2,4,wb,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy (4,89)

31 Write-Back-Phase Details (Example 4) Situation: Result was stored in ROB-entry 4 containing add-operation has also stored the result was freed Programmspeicher OP B PC res add r <- r, r2 sub r3 <- r, r R: 5 R: 4 R2: 89 R3: 7 // Op A // Op B addr type valid busy 5 4 : 2: 3: 4: [2,-,,,] 5: [-,-,,,] [add,,,4,2,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

32 Commit Details (ALU-Operation) It must be checked: ROB[first].valid = Update by commit: for all Architectural Registers r with Reg[r].rob = head do Reg[r] := ROB[head].res Reg[r].rob :=

33 Commit-Phase Details (Example 5) Situation: Let be head = 4 for the ROB-head R2 waits for result from ROB-entry 4 res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy (4,89) Programmspeicher OP B PC R: 5 R: 4 R2: 89 R3: : 2: 3: 4: [2,-,,,] 5: [-,-,,,] [add,,,56,89,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

34 Commit-Phase Details (Example 5) Situation: R2 has received result from ROB res add r <- r, r2 sub r3 <- r, r // Op A // Op B addr type valid busy (4,89) Programmspeicher OP B PC R: 5 R: 4 R2: 2 R3: 7 5 : 2: 3: 4: [2,-,,,] 5: [-,-,,,] [add,,,56,89,-,,5,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

35 Executing Branch Operations Issue: Vk-field: stores branch target z misc-field: remembers address a of the branch operation misc-field: also remembers which address ( z or a ) was predicted Execute: Computed target address is stored in Vk-Field of : Vk := z, if branch is taken Vk := a+, if branch is not taken misc-field stores, whether or not prediction was correct ( c = correct; w = wrong) Write-Back: res-field of ROB received branch target (Vk-field of the ) addr-field receives value of misc-field from : c or w Commit: If addr-field = c, nothing must be done (operations were fetched from correct address) If addr-field = w, then copy res-field into PC and flush the whole pipeline: All subsequent ROB-entries All -entries instruction queue

36 Branch Details (Example 6 correct prediction) Situation: Branch-operation was issued to 2 Branch depends on ld-operation Op A, Op B, will be executed speculateively : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP C OP B OP A PC 2 R: 5 R: 4 R2: 89 R3: : [-,-,2,,] 2: [-,-,3,,] 3: [-,-,-,,] 4: [-,-,-,,] 5: [-,-,-,,] [ld,,,-,2,-,2,,ex,] [bz,,,-,23,2a,3,2,ro,] Memory ld Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

37 Branch Details (Example 6 correct prediction) Situation: Op A is executed speculatively Op B waits for result of Op A : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 89 R3: 3 4 : [-,-,2,,] 2: [-,-,3,,] 3: [-,-,,,] 4: [-,-,,,] 5: [-,-,-,,] [ld,,,-,2,-,2,,ex,] [bz,,,-,23,2a,3,2,ro,] OP A OP B Memory ld Unit OP A Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

38 Branch Details (Example 6 correct prediction) Situation: Op A wrote result to ROB, but not to R Op B is executed speculatively : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 89 R3: 3 4 : [-,-,2,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-,-,,,] 5: [-,-,-,,] [ld,,,-,2,-,2,,ex,] [bz,,,-,23,2a,3,2,ro,] OP B Memory ld Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

39 Branch Details (Example 6 correct prediction) Situation: Op B wrote result to ROB : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: ld write result to ROB res addr type valid busy bz can be executed Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 89 R3: 3 4 : [6,-,2,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] [bz,,,6,23,2a,3,2,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

40 Branch Details (Example 6 correct prediction) Situation: bz will be executed: Branch is not taken Commit for ld-operation is done : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] [bz,,,6,23,2a,3,2,ex,] Memory Unit BZ Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

41 Branch Details (Example 6 correct prediction) Situation: bz was executed : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: WB for bz was done res addr type valid busy Prediction was correct (ROB[2].addr := c) Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [3,c,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

42 Branch Details (Example 6 correct prediction) Situation: Commit of the branch operation does not require any action, because prediction was correct : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Now Commit can be done for speculatively executed operations A and B Programmspeicher OP E OP D OP C PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [-,-,-,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

43 Branch Details (Example 7 wrong prediction) Situation: Same situation as in example 6 : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: But, ld-operation has stored in R2 res addr type valid busy I.e., branch is taken Programmspeicher OP F OP E OP D PC R: 5 R: 4 R2: R3: 3 4 : [-,-,-,,] 2: [-,-,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] [bz,,,,23,2a,3,2,ex,] OP C Memory Unit BZ Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

44 Branch Details (Example 7 wrong prediction) Situation: bz-operation was executed : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: Prediction was wrong (ROB[2].addr := w) res addr type valid busy Correct target can be found in the res-field Programmspeicher OP G OP F OP E PC R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [23,w,3,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] OP C OP D Memory Unit OP C Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

45 Branch Details (Example 7 wrong prediction) Situation: Commit of the branch moves correct address into PC : ld r2 <- (2) 2: bz r2, #23 3: add r <- r, r // Op A 4: sub r3 <- r, r // Op B 5: res addr type valid busy Flushing the pipeline Programmspeicher OP G OP F OP E PC 23 R: 5 R: 4 R2: 6 R3: 3 4 : [-,-,-,,] 2: [-,-,-,,] 3: [9,-,,,] 4: [-4,-,,,] 5: [-,-,-,,] OP C OP D Memory Unit OP C Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

46 Executing Memory Operations For out-of-order-execution of memory operations holds: Ordering of load-does not matter Ordering of load- and store-operations as well as of store- and storeoperations must be maintained Example: ld r2 <- (r) ld r <- (r4) st r4 -> (r) ld r5 <- (r6) st r7 -> (r8) Strategy: Writing to memory takes place during commit-phase (in-order) Reading from memory takes place during execute-phase (out-of-order) But only, if valid-field of all preceding write-operations in the ROB is

47 Example (store-operation) Issue-Phase: issue the first st-operation st r3 -> (r) ld r3 <- (r) st r-> (r2) Execute-Phase Execution of store-operation can start, if both source operands are available Execution has no effect Rather, WB of st-operation starts Programmspeicher immediately M PC res R: 5 R: 4 R2: 2 R3: 7 addr type valid busy : [-,-,2,,] 2: 3: 4: 5: [st,,,7,5,-,2-,,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

48 Example (store-operation) Updates during WB of the st-operation ROB[x].res := [y].vj ROB[x].addr := [y].vk st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 : [-,-,2,,] 2: 3: 4: 5: [st,,,7,5,-,2-,,wb,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

49 Example (store-operation) Commit for st-operation MEM[ROB[first].addr] := ROB[first].res st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 : [7,5,2,,] 2: 3: 4: 5: Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

50 Suppose first st-operation was issued and waits for execution Then ld-operation was issued, and its source operands are available Example (load-operation) Programmspeicher M PC st r3 -> (r) ld r3 <- (r) st r-> (r2) res R: 5 R: 4 R2: 2 R3: 7 addr type valid busy 2 : [-,-,2,,] 2: [-,-,2,,] 3: 4: 5: OP C [st,5,,-,5,-,2,,ro,] [ld,,,4,-,-,2,2,ro,] OP C Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

51 Example (load-operation) Situation : ld-operation is not executed, because valid-bit of first st-operation is st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 2 : [-,-,2,,] 2: [-,-,2,,] 3: 4: 5: OP C [st,5,,-,5,-,2,,ro,] [ld,,,4,-,-,2,2,ro,] OP C Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

52 Example (load-operation) Situation: Now, ld-operation can be executed (see valid-bit of first st-operation) st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy ld-operation can read value either from memory or from ROB (if addr-field Programmspeicher of a preceding st-operation matches Vj-field of ld-operation M PC R: 5 R: 4 R2: 2 R3: 7 2 : [7,5,2,,] 2: [-,-,2,,] 3: 4: 5: [ld,,,4,-,-,2,2,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

53 Example (load-operation) WB for ld-operation complete Commit-phase for ld-operations is the same as for alu-operations st r3 -> (r) ld r3 <- (r) st r-> (r2) res addr type valid busy M Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 2 : [7,5,2,,] 2: [2,-,2,,] 3: 4: 5: [ld,,,4,-,-,2,2,ro,] Memory Unit Execute Execute m opc Qj Qk Vj Vk misc type rob stat busy

54 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle ) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC add r3,r3,r4 mul r4,r,r2 ld r,(r) R: R: R2: 3 R3: R4: : 2: 3: 4: 5: Memory Unit Execute Execute 2

55 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle ) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC add r,r, R: R: R2: 3 : 2: 3: ld r,() add r3,r3,r4 mul r4,r,r2 R3: R4: 4: 5: ld r,() Memory Unit Execute Execute 2

56 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 2) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC bne loop,r,2 R: R: R2: 3 : 2: 3: ld r,() mul r4 add r,r, add r3,r3,r4 R3: R4: 2 4: 5: ld r,() 2 mul r4,rob, Memory ld r,() Unit Execute Execute 2

57 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 3) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC ld r,(r) R: R: R2: 3 : 2: 3: ld r,() mul r4 add r3 bne loop,r,2 add r,r, R3: 3 R4: 2 4: 5: ld r,() 2 mul r4,rob,3 add r3,,rob Memory ld r,() Unit Execute Execute 2

58 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 4) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC mul r4,r,r2 R: R: R2: 3 4 : 2: 3: 2 mul r4 add r3 ld r,(r) bne loop,r,2 R3: 3 R4: 2 4: 5: add r ld r,() 2 mul r4,2,3 add r3,,rob2 3 4 add r,, 5 6 Memory Unit Execute Execute 2

59 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 5) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 add r3,r3,r4 R: 2 R: R2: 3 4 : 2: 3: mul r4 add r3 mul r4,r,r2 ld r,(r) R3: 3 R4: 2 4: 5: add r bne 2 mul r4,2,3 add r3,,rob2 3 4 add r,, 5 bne loop,rob4,2 6 Memory Unit EU-Bus mul Execute r4,2,3 add Execute r,, 2

60 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 6) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 Operations are fetched and issued speculatively add r,r, add r3,r3,r4 mul r4,r,r2 R: 2 R: R2: 3 4 R3: 3 R4: 2 : 2: 3: 4: 5: ld r 26 add r3 bne Operations from different loop iterations are in the pipeline ld r,() 2 mul r4,2,3 add r3,, add r,, bne loop,,2 5 6 ld-operation is no longer dependent on the branch operation Memory Unit Execute Execute 2

61 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 7) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 bne loop,r,2 R: 2 R: R2: 3 4 : 2: 3: ld r mul r4 add r3 add r,r, add r3,r3,r4 R3: 3 R4: : 5: bne Now speculative execution possible ld r,() 2 mul r4,rob,3 add r3,, bne loop,,2 5 6 Memory ld r,() Unit add Execute r3,,26 bne loop,,2 Execute 2

62 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 8) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 bne loop,r,2 add r,r, R: 2 R: R2: 3 R3: 4 3 : 2: 3: 4: ld r mul r4 26 add r3,r3,r4 R4: : bne c ld r,() 2 mul r4,rob,3 add r3,, bne loop,,2 5 6 Memory ld r,() Unit Execute Execute 2

63 Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 9) Loop: ld r <- (r) mul r4 <- r, r2 add r3 <- r3, r4 add r <- r, bne loop, r,2 PC 5 ld r,(r) bne loop,r,2 R: 2 R: R2: 3 R3: : 2: 3: 4: 8 mul r4 add r3 add r,r, R4: : bne c ld r,() 2 mul r4,8,3 add r,, 3 4 add r3,26,rob2 5 6 Memory Unit Execute Execute 2

64 Summary We have seen Tomasulo-algorithm with speculation Importance of the Reorder-Buffer Execution of Alu-operations Branch-operations Memory-operations But: Issue- and Commit-phase are limited to processing a single operation per clock cycle

65 Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading

66 Superscalar Instruction Pipeline So far: Only data path is superscalar Parallel execution of operation in the data path, but CPI < not possible Required: super scalar Fetch-, Issue-, WB-, Commit-Phase Programmspeicher PC R: 5 R: 4 R2: 2 R3: 7 2 ROB Memory Unit Execute Execute m

67 Superscalar Fetch-Phase Fetch: Fetching n operations simultaneously from code cache/memory Requires wider busses Cache/Memory n operation bus n operation bus Instruction queue Register File operand bus A A n operand bus B operand bus B n

68 Superscalar Issue-Phase Issue: Issue the first n Operations from the instruction queue (n operation busses required) n operand busses for left operand required (A) n operand busses for right operand required (B) Checking for free and free ROB-entry must be done simultaneously for up to n operations! Cache/Memory n operation bus n operation bus Instruction queue Register File operand bus A A n operand bus B operand bus B n

69 Implementing simultaneous checking in issue-phase For a single operations For two operations Old ROB Status Old Status Old ROB Status Old Status RF Control for operand buses A and B New state for ROB Issue-Logic ( Operation) New state for RF Control for operand buses A and B Issue-Logic (. Op) RF control for operand busses A2 und B2 Issue-Locik (2. Op) Combine New state for ROB New state for

70 Superscalar WB-Phase Every EU has its own result bus E i All EUs may write simultaneously to the ROB This makes also the bypass for the reservation stations more complex A A n B B n E E m R R k ROB Bypass opc Qj Qk Vj Vk misc type rob stat busy Bypasses to Memory Unit Execute Execute m E E E m result busses

71 Superscalar Commit-Phase For up to n ROB-entries starting at the head: check if the valid-bit is set to Then write their result to the register file Register file needs n write-ports A A n B B n E E m R R k Bypass Register File ROB opc Qj Qk Vj Vk misc type rob stat busy Bypasses to Memory Unit Execute Execute m E E E m result busses

72 Example PowerPC Quelle: PowerPC e5 Core Family Reference Manual

73 Limitations for ILP Memory band width limits the amount of simultaneously fetched operations (typical 4 to 6 operations) HW-Overhead and delay for: Control logic issue-phase Bypasses for reservation stations Number of read-/write-ports in the register file Branches Possible Solution: Branch prediction Available parallelism in the application Possible Solution: Multithreading

74 Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading

75 Motivation for Multithreading True dependencies prevent the EUs from being used in parallel (horizontal performance loss) Operations with a very long delay during execution create vertical performance loss E.g. memory access of an operation A in a Pentium 4 (3-way-superscalar) can take 38 clock cycles (cache misses) I.e. 4 operations have to bypass Op A in order to utilize EUs fully during this time But: Reorder buffer has only 26 entries Hence, 339 execution cycles are wasted Solution Multithreading: Execute multiple threads that share the same execution units, but have no dependencies OP OP 2 OP n OP OP 2 OP 3 OP A OP 4 OP 5 OP 6 after 38 cycles WB of A only 4 cycles EU usage EU usage 2 3 A

76 Process vs. Thread Each process has its own context address space (Code, data, heap, stack) TLB Switching between processes takes tens of thousands of clock cycles (context switch) Threads share the same context Switching between two threads only requires to change the values in the architectural registers OS is involved Code Section Code Section Code Section Code Section Data Section Data Section Data Section Data Section Heap Heap Heap Heap Stack 2 Stack Stack 2 Stack Stack Stack Thread in Process 2 Thread 2 in Process 2 Process Process 2

77 Multithreading Programmspeicher PC PC 2 OP E OP C OP A Instruction queue OP F OP D OP B Instruction queue RF RF 2 ROB ROB 2 EU-Bus Memory Unit Execute Execute m Multithreading: A fixed number of n threads can share the same execution units Hardware supports fast switching between n threads: n copies of some resources, e.g. architectural registers (including PC) fix partitioning of some resources, e.g. (or limited sharing) shared usage of some resources, e.g. EUs

78 Multithreading Types of Multithreading: no MT Coarse Grained MT Fine Grained MT Simultaneous MT

79 Coarse Grained Multithreading A single thread runs for many clock cycles before the hardware switches to another thread Hardware switches between threads only, if a long running operation is detected, e.g. cache miss, or a fix time slice has passed A processor with n-way MT appears to an operating system like n processors OS schedules n threads of the same process to these processors

80 Example Two threads are scheduled to the processor Reservation stations and EUs are shared resources Hardware switches between both PCs and IQ (e.g. by multiplexors) Fetched operations are tagged with thread number Situation: Thread is running Instruction Queue thread Programmspeicher OP F. OP C.2 OP E. OP B.2 OP D. OP A.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. ROB 2 OP C. OP A. OP B. Memory Unit OP A. Execute OP B. Execute m

81 Example Memory operation D. of thread was issued Thread is still running Instruction Queue thread Programmspeicher OP G. OP F. OP E. OP C.2 OP B.2 OP A.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. ROB 2 OP C. OP D. OP A. OP B. Memory Unit OP A. Execute OP B. Execute m

82 Example Memory operation is executed and cache miss is detected Processor has switched to thread 2 another PC is used another instruction queue is used Instruction Queue thread Programmspeicher OP H. OP G. OP F. OP C.2 OP B.2 OP A.2 2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. OP E. ROB 2 OP C. OP E. OP D. OP A. OP D. Memory Unit OP A. Execute Execute m

83 Example Issued operations of thread are further processed But issue now takes place from instruction queue 2 2 Instruction Queue thread Programmspeicher OP H. OP D.2 OP G. OP F. OP C.2 OP B.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. OP E. OP A.2 ROB 2 OP C. OP E. OP D. OP A. OP A.2 OP D. Memory Unit OP A. Execute Execute m

84 Example Issued operations of thread are further processed But issue now takes place from instruction queue 2 2 Instruction Queue thread Programmspeicher OP H. OP G. OP E.2 OP D.2 OP F. OP C.2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP A. OP B. ROB OP C. OP D. OP E. OP A.2 OP B.2 ROB 2 OP B.2 OP C. OP E. OP D. OP A.2 OP D. Memory Unit Execute OP A.2 Execute m

85 Example Operations of Thread are further processed, but not committed while simultaneously operations from Thread 2 are processed If operation E. has a true-dependency to D. then it blocks the reservation station for operations from Thread 2 Balancing between shared resources important Instruction Queue thread Programmspeicher OP H. OP G. OP F. OP F.2 OP E.2 OP D.2 2 Instruction Queue thread 2 PC PC 2 RF RF 2 OP B. ROB OP C. OP D. OP E. OP A.2 OP B.2 ROB OP C.2 2 OP B.2 OP C. OP E. OP D. OP C.2 OP D. Memory Unit Execute Execute m

86 Coarse Grained MT - Limitations Does not help to overcome the problem of horizontal performance loss (a single thread may not have enough ILP) Only right after switching between threads, there are operations of both threads simultaneously processed Switching between threads may has a negative impact on the cache hit rate for each thread and affects the performance negatively

87 Fine-Grained Multithreading Processor switches in every clock cycle to another thread E.g. in a round robin manner: This helps to overcome horizontal performance loss A single instruction queue and a single reorder buffer are sufficient (shared) Operations must be tagged with the corresponding Thread number

88 Example Programmspeicher PC PC 2 OP A. Instruction queue RF RF 2 ROB Memory Unit Execute Execute m

89 Example 2 Programmspeicher PC PC 2 OP A.2 OP A. Instruction queue RF RF 2 ROB Memory Unit Execute Execute m

90 Example Programmspeicher PC PC 2 OP B. OP A.2 OP A. Instruction queue RF RF 2 ROB Memory Unit Execute Execute m

91 Example 2 Programmspeicher PC PC 2 OP A. OP B.2 OP B. OP A.2 Instruction queue RF RF 2 ROB OP A. Memory Unit Execute Execute m

92 Example Programmspeicher OP C. OP B.2 OP B. Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 ROB OP A.2 Memory Unit OP A. Execute Execute m

93 Example 2 Programmspeicher OP C.2 OP C. OP B.2 Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 OP ROB B. OP B. Memory Unit OP A. Execute OP A.2 Execute m

94 Example Programmspeicher OP D. OP C.2 OP C. Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 OP ROB B. OP B.2 OP B. OP B.2 Memory Unit OP A. Execute Execute m

95 Example 2 Programmspeicher OP D.2 OP D. OP C.2 Instruction queue PC PC 2 RF RF 2 OP A. OP A.2 OP ROB B. OP B.2 OP C. OP C. Memory Unit OP B. Execute OP B.2 Execute m

96 Example Programmspeicher OP D.2 OP D. OP C.2 Instruction queue PC PC 2 RF RF 2 OP A.2 OP ROB B. OP B.2 OP C. OP C. Memory Unit OP B. Execute OP B.2 Execute m

97 Fine-Grained Multithreading - Limitations Vertically performance loss cannot be avoided A long running operation prevents other operation from the same thread from being executed due to the shared IQ and ROB, also the other thread is blocked after a while Improvement: Stop fetching for a blocked thread Performance of a single thread is reduced (even if there are no operations from a second blocked thread), because issue takes place in every second cycle MT reduces cache performance

98 Simultaneous Multithreading Mixing Coarse- and Fine-Grained MT In every clock cycle operations from n threads will be fetched and issued (Intel calls this Hyperthreading) Operations must be tagged with the corresponding Thread number Solving the problem of having either horizontal or vertical performance loss: If both threads are not blocked, then available ILP is utilized, and horizontal performance loss is avoided If one thread is blocked, then the other thread still uses the resources, and vertical performance loss is avoided (but not horizontal one) Even if one thread is blocked, the other one can run at full speed (issue in every clock cycle)

99 Example Fetch and issue takes place simultaneously for both threads Each thread has its own IQ, RF, PC, ROB Reservation Stations are partitioned Programmspeicher OP E OP C OP A Instruction queue OP F OP D OP B Instruction queue 2 PC PC 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 Memory Unit Execute Execute m

100 Example Both threads are executed... Avoids horizontal performance loss Programmspeicher PC PC 2 OP G OP E OP C Instruction queue OP H OP F OP D Instruction queue 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 OP B OP A Memory Unit Execute Execute m

101 Example Both threads are executed... Avoids horizontal performance loss Programmspeicher PC PC 2 OP I OP G OP E Instruction queue OP J OP H OP F Instruction queue 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 OP C OP B OP A OP D Memory Unit OP B Execute OP A Execute m

102 Example Both threads are executed... Avoids horizontal performance loss But, now long running operation E is issued Programmspeicher PC PC 2 OP A OP B OP K OP I OP G Instruction queue OP L OP J OP H Instruction queue 2 RF RF 2 ROB ROB 2 Used for thread Used for thread 2 OP E OP C OP D OP F Memory Unit OP C Execute OP D Execute m

103 Example Assume G is true dependent on E Programmspeicher OP M OP K OP I Instruction queue OP N OP L OP J Instruction queue 2 PC PC 2 Res A RF RF 2 Res B Res C ROB Res D ROB 2 Used for thread Used for thread 2 OP E OP G OP H OP F OP E Memory Unit Execute OP F Execute m

104 Example Assume G is true dependent on E; and I, too Then, thread is now blocked Programmspeicher OP O OP M OP K Instruction queue OP P OP N OP L Instruction queue 2 PC PC 2 Res A RF RF 2 Res B Res C ROB ROB Res F 2 Used for thread Used for thread 2 OP E OP G OP H OP I OP J OP E Memory Unit OP H Execute Execute m

105 Example but thread 2 can continuous Programmspeicher OP O OP M OP K Instruction queue OP Q OP P OP N Instruction queue 2 PC PC 2 Res A RF RF 2 Res B Res C ROB ROB 2 Used for thread Used for thread 2 OP E OP G OP H OP I OP J OP L OP E Memory Unit OP H Execute Execute m

106 Summary - Multithreading Allows to fill the pipeline with operations from different threads no data dependency between operations from different threads allows for higher resource utilization Coarse-grained MT suffers from horizontal performance loss Fine-grained MT suffers from horizontal performance loss SMT solves these problems Improvement: Balancing between partitioned resources All MT approaches have impact on the cache performance In particular Fine-Grained MT can be also used in statically scheduled processor pipelines to avoid hazards In a pipeline with n pipeline stages, operations from n threads are issued no data-/control hazard occur because operations in the pipeline have no dependencies

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm: LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

CSE 490/590 Computer Architecture Homework 2

CSE 490/590 Computer Architecture Homework 2 CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections ) Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated

More information

DYNAMIC SPECULATIVE EXECUTION

DYNAMIC SPECULATIVE EXECUTION DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/

More information

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

COSC 6385 Computer Architecture - Instruction Level Parallelism (II) COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Out of Order Processing

Out of Order Processing Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation Graduate Computer Architecture Chapter 3 Instruction Level Parallelism and Its Dynamic Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding (Appendix A.8) Tomasulo

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard. Calcolatori Elettronici e Sistemi Operativi Pipeline issues Hazards Pipeline issues Data hazard Control hazard Structural hazard Pipeline hazard: RaW Pipeline hazard: RaW 5 6 7 8 9 5 6 7 8 9 : add R,R,R

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction

More information

Tomasulo s Algorithm

Tomasulo s Algorithm Tomasulo s Algorithm Architecture to increase ILP Removes WAR and WAW dependencies during issue WAR and WAW Name Dependencies Artifact of using the same storage location (variable name) Can be avoided

More information

TDT 4260 TDT ILP Chap 2, App. C

TDT 4260 TDT ILP Chap 2, App. C TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 4 Tien-Fu Chen National Chung Cheng Univ. chap4-0 Advance Pipelining! Static Scheduling Have compiler to minimize the effect of structural, data, and control dependence "

More information

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010 CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

More information

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm EEF011 Computer Architecture 計算機結構 吳俊興高雄大學資訊工程學系 October 2004 Example to eleminate WAR and WAW by register renaming Original DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6,

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Superscalar Architectures: Part 2

Superscalar Architectures: Part 2 Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information