OOO Execution and 21264

Size: px
Start display at page:

Download "OOO Execution and 21264"

Transcription

1 OOO Execution and

2 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? 2

3 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP) 2

4 The Basic 5-stage Pipeline Fetch Deco EX Mem Write de back Like an assembly line -- instructions move through in lock step In the best case, it can achieve one instruction per cycle (IPC). In practice, it s much worse -- branches, data hazards, long-latency memory operations cause much lower IPC. We want an IPC > 1!!! 3

5 Approach 1: Widen the pipeline Fetch Decode EX Mem Write PC 2 de inst Two back and PC+4 Fetch 4 values Memory 2 values Fetch Deco EX ops Write de back Process two instructions at once instead of 1 Often 1 odd PC instruction and 1 even PC This keeps the instruction fetch logic simpler. 2-wide, in-order, superscalar processor Potential problems? 4

6 Dual issue: Structural Hazards Structural hazards We might not replicate everything Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place? Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << If an upper instruction needs the lower pipeline, squash the lower instruction 5

7 Dual issue: Structural Hazards Structural hazards We might not replicate everything Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place? Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << If an upper instruction needs the lower pipeline, squash the lower instruction 5

8 Dual issue: Data Hazards The lower instruction may need a value produced by the upper instruction Forwarding cannot help us -- we must stall. Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << 6

9 Compiling for Dual Issue The compiler should Pair up non-conflicting instructions Align branch targets (by potentially inserting noops above them) These are similar to the rules for VLIW, but they are just guidelines, not rules. 7

10 Beyond Dual Issue Wider pipelines are possible. There is often a separate floating point pipeline. Wide issue leads to hardware complexity Compiling gets harder, too. In practice, processors use of two options if they want more ILP If we can change the ISA: VLIW If we can t: Out-of-order 8

11 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s3 1 2: sub $t2,$s3,$s4 2 3: or $t5,$t1,$t2 3 4: add $t3,$t1,$t2 4 9

12 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s : sub $t2,$s3,$s4 2 3: or $t5,$t1,$t : add $t3,$t1,$t2 4 There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 9

13 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s : sub $t2,$s3,$s4 2 3: or $t5,$t1,$t : add $t3,$t1,$t2 We can parallelize instructions 4 There is parallelism!! We can execute 1 & 2 at once and that do not have a read-afterwrite dependence (RAW) 3 & 4 at once 9

14 Data dependences In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously. But beware: Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $t1,$s3,$s4 Can we reorder the instructions? 1 2 2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? 10

15 Data dependences In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously. But beware: Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $t1,$s3,$s4 Can we reorder the instructions? 1 2 2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? No! The final value of $t1 is different 10

16 False Dependence #1 Also called Write-after-Write dependences (WAW) occur when two instructions write to the same value The dependence is false because no data flows between the instructions -- They just produce an output with the same name. 11

17 Beware again! Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $s2,$s3,$s4 1 2 Can we reorder the instructions? 2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? 12

18 Beware again! Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $s2,$s3,$s4 1 2 Can we reorder the instructions? 2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3 No! The value in $s2 that 1 needs will be destroyed Is the result the same? 12

19 False Dependence #2 This is a Write-after-Read (WAR) dependence Again, it is false because no data flows between the instructions 13

20 Out-of-Order Execution Any sequence of instructions has set of RAW, WAW, and WAR hazards that constrain its execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences? 14

21 The Central OOO Idea 1. Fetch a bunch of instructions 2. Build the dependence graph 3. Find all instructions with no unmet dependences 4. Execute them. 5. Repeat 15

22 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 WAR WAW RAW 3: or $t3,$t1,$t2 4: add $t5,$t1,$t2 16

23 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t2 3 4: add $t5,$t1,$t2 4 16

24 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t

25 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2 8: add $t3,$t5,$t

26 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t2 5: or $t4,$s1,$s : mul $t2,$t3,$s5 6 7: sl $t3,$t4,$t2 7 8: add $t3,$t5,$t1 8 16

27 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t

28 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t

29 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t Instructions in 5 cycles 16

30 Simplified OOO Pipeline A new schedule stage manages the Instruction Window The window holds the set of instruction the processor examines The fetch and decode fill the window Execute stage drains it Typically, OOO pipelines are also wide but it is not necessary. Impacts More forwarding, More stalls, longer branch resolution Fundamentally more work per instruction. Fetch Deco Sche EX Mem Write de dule back 17

31 The Instruction Window The Instruction Window is the set of instruction the processor examines The fetch and decode fill the window Execute stage drains it The larger the window, the more parallelism the processor can find, but... Keeping the window filled is a challenge 18

32 Case Study: Alpha 21264

33 Digital Equipment Corporation One of the Big Old Computer companies (along with IBM) Business-oriented computers Check out Gordon Bell s lecture in History of Computing class They produced a string of famous machines Sold to Compaq in 1998 Sold to HP (and Intel) in 2002

34 The PDPs Most famous: PDP-11 Birthplace of UNIX Elegant ISA Designed by a small team in short order In response to competitor Formed by defecting engineers 16 bits of virtual address PDP-5 and PDP-8 were 12 bits Chronically short of address bits Sold until 1997

35 The VAX (In)famous and long-lived for "Virtual Address Extension (to the PDP-11) LOTS of extensions Very CISCy -- polynomial evaluate inst. Etc.

36 The Alpha Four processors 21064, 21164, 21264, 21364, (21464) 21 for 21st century ; 64 - for 4 bit High-end workstations/servers Fast processors in the world at introduction Unix, VMS (old VAX OS), WindowsNT, Linux Alpha died when Intel bought the IP and the design team.

37 AlphaAXP New ISA from scratch No legacy anything (almost) VAX-style floating point mode 64-bit Very clean RISC ISA Register-Register/Load-Store No condition codes Conditional moves -- reduced branching, but at what cost? 32 GPRs and FPRs OS support PALCode -- firmware control of low-level hardware VAX compatibility provided in software VAX ISA -> Alpha via a compiler

38 Alpha Introduced in Mhz (blazingly fast at the time) 750nm/0.75micron (vs 45nm today) 234mm 2 die, 1.6M transistors 33 Watts Full custom design

39 Alpha (cont) Pipeline Dual issue 7 stage integer/10 stage FP 4 cycle mis-prediction penalty. 45 bypassing paths 22 instructions in flight Caches On-chip L1I + L1D. 8KB each Off-chip L2 Branch prediction Static: forward taken/back not taken Simple dynamic prediction 80% accuracy

40 Alpha Introduced in Mhz 500nm/0.5micron 299mm 2 die, 9.7M transistors 56W

41 Alpha (cont) Pipeline Quad issue: 2 integer + 2 FP 7 stage integer/10 stage FP Caches On-chip L1I + L1D. 8KB each. Direct-mapped (fast!) Hit under miss/miss under miss (21 outstanding at once) On-chip 3-way 96KB L2. Off-chip L3 (1-64MB) ISA changes Native support for byte operations Branch prediction 5 cycle mispredict penalty History-based dynamic predictor. Bits stored per cache line.

42 Alpha Introduced in Mhz-1.2Ghz micron 314mm 2 die, 15.2M transistors 73W

43 Alpha (cont) Pipeline 6-issue: 4 integer + 2 FP 7 stage integer/longer for FP, depending or op. 80 in-flight instructions Caches On-chip L1I + L1D. 64KB each. 2-way Off-chip L2 Compared to x the L1 capacity, but no onchip L2

44 Aggressive Speculation The executes instructions that may or may not be on the correct path. When it s wrong, it has to undo those instructions It stores backups of renaming tables, register file, etc. It also must prevent changes to memory from occurring until the instructions commit 31

45 In Order Fetch and Commit Fetch is in-order Execution is out of order Extract as much parallelism as possible Commit is in-order Make the changes permanent in program order. This is what is visible to the programmer. This enables precise exceptions (mostly) 32

46 Alpha (cont) Fetch unit Pre-decodes instructions in the Icache next line and set predictors -- correct % Tournament predictor A local history predictor + A global history predictor A third predictor to track which one is most effective 2 cycle to make a prediction 33

47 Alpha 21264: I Cache/fetch 64KB, 2-way, 16byte lines (4 instructions) Each line also contains extra information: Instructions Next Line Next Way Pre-decoded bits Incorporates BTB and parts of instruction decode BTB data is protected by 2-bits of hysteresis, trained by branch predictor. Branch prediction is aggressive to find parallelism and exploit speculative out-of-order execution. We wants lots of instructions in flight. On a miss, it prefetches up to 64 instructions

48 Alpha Slot Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

49 Alpha Slot Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

50 Alpha Slot Out-of-order Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

51 Slot Out-of-order Fetch Rename Issue Alpha Reg Read Execute Cluster Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

52 Slot Out-of-order Fetch Rename Issue Alpha Reg Read Execute Cluster Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Dual ported L1 Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

53 How Much Parallelism is There? Not much, in the presence of WAW and WAR dependences. These arise because we must reuse registers, and there are a limited number we can freely reuse. How can we get rid of them? 36

54 Removing False Dependences If WAW and WAR dependences arise because we have too few registers Let s add more! But! We can t! The Architecture only gives us 32 (why or why did we only use 5 bits?) Solution: Define a set of internal physical register that is as large as the number of instructions that can be in flight in the latest intel chip. Every instruction in the pipeline gets a registers Maintaining a register mapping table that determines which physical register currently holds the value for the required architectural registers. This is called Register Renaming 37

55 Alpha 21264: Renaming Separate INT and FP Replaces architectural registers with physical registers 80 integer physical registers 72 FP physical registers Eliminates WAW and WAR hazards Register map table maintains mapping between architectural and physical registers One copy for each in-flight instruction (80 copies) Special handling for conditional moves.

56 Alpha 21264: Renaming Two parts Content-addressable lookup to find physical register inputs Register allocation to rename the output Four instructions can be renamed each cycle. 8 ports on the lookup table 4 allocations per cycle There is no fixed location for architectural register values! How can we read architectural register r10?

57 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 Alpha 21264: Renaming Register map table 1: r1 r2 r3 p1 p2 p : 3: 4: 5: 5 RAW WAW WAR

58 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : 3: 4: 5: 5 RAW WAW WAR

59 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: 4: 5: 5 RAW WAW WAR

60 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: 5: 5 RAW WAW WAR

61 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: 5 RAW WAW WAR

62 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 p8, p6, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: p6 p8 p4 5 RAW WAW WAR

63 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 p8, p6, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: p6 p8 p4 RAW WAW WAR

64 Alpha 21264: Issue Queue Separate Int and FP Decouple front and back ends Dynamically track dependences Instructions can issue once their input registers are written Track register status in register scoreboard Issue instructions around long-latency operations Exploit cross-loop parallelism Issue up to 4 instructions/cycle (2 floating point) Issue oldest first Compact the queue (the free slots are always mostly at the top)

65 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU

66 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU 4 5

67 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p8 1 3 Register File ALU ,- ALU 4 5

68 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p8 1 Register File ALU ,3 ALU 4 5

69 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU 2 ALU 3 4 5

70 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU 3 4 5

71 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File 5,4 ALU ALU 3

72 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File ALU 5 ALU 4

73 The Issue Window Decoded Instruction data alu_out_dst_0 vrs vrt alu_out_dst_1 alu_out_value_0 alu_out_value_1 opcode etc vrs vrt rs_value valid rt_value valid = = = = Ready rt_value rs_value opcode 56

74 The Issue Window ALU0 insts Arbitration ALU1 57

75 Alpha 21264: Execution Integer ALUs are clustered Two ALUs share a complete replica of the Int register file 1 cycle extra latency for cross-cluster updates Not a big performance hit Issue queue can issue any instruction to either cluster Critical paths tend to stay in one cluster Area savings Register file size is quadratic in # of ports Each replica needs 4 read, 4 write ports (2 local writes, 2 remote) Unclustered -> 8 read, 4 write ports O(2*8 2 ) vs O(12 2 ) Simpler too. This is the beginning of the slow wires problem

76 Alpha 21264: Memory Interface Memory is king!!! One of Alpha s niche markets was large, memory-intensive applications They went 64-bits for the physical address space as much as for the virtual. Lots of outstanding requests 32 loads, 32 stores (D only) 8 cache misses (I + D) Big caches (64KB, 2-way) What does Patterson s thumb say? 2 loads/stores per cycle Double-pumped instead of multi-ported. (area vs clock rate) Virtually-index, physically tagged 8-entry victim buffer shared between L1I and L1D

77 Alpha 21264: Memory interface Memory ordering Renames memory locations LDQ/STQ 32 entries each. Sorted in fetch order (but arrive out-of-order) Instruction remain in the queues until retirement Load watch for younger stores to the same address Squash the load and subsequent instructions if a match occurs Stores watch for younger stores Speculative loads get speculative data from speculative store data buffer

78 61

79 Alpha 21264: Retirement Instructions retire in-order At retirement Stores write to memory Renamed registers are released Each instruction carries the physical register number that held the previous value for the instruction s architectural destination register. Since retirement is in-order, that register is dead. On exceptions, All younger instructions are squashed Register map reverts to state before the exception.

80 Alpha 21264: Memory interface Ordering violations Source: ST r0, 0(r10) LD r1, 0(r11) Execution: LD r1, 0(r11) ST r0, 0(r10) R11 == r10 => violation, pipe flush Mark the Load as delayed In the future, it will wait for all previous stores Clear the delayed flag ever 16,384 cycles

81 Alpha 21264: Memory Interface Speculative cache hits (integer only) The instruction queue assumes loads hit the L1 When they don t hit, do a mini-restart Up to 8 instructions are pulled back into the issue queue to be reissued Results in a 2 cycle bubble A single 4-bit predictor tracks the miss behavior.

82 Alpha Introduced Ghz 0.18micron, 130M transistors 400mm Watts MB on-chip L2 Essentially a with an on-chip cache.

83

84

85 300MHz 1.7x improvement

86 300MHz 1.7x improvement 600MHz 1.8x improvement

87 600MHz 1.8x improvement 300MHz 1.7x improvement 27.8x improvement 8.3x cycle time improvement 3.5x from architecture

88 Modern OOO Processors The fastest machines in the world are OOO superscalars AMD Barcelona 6-wide issue 106 instructions inflight at once. Intel Nehalem 5-way issue to 12 ALUs > 128 instructions in flight OOO provides the most benefit for memory operations. Non-dependent instructions can keep executing during cache misses. THis is so-called memory-level parallelism. It is enormously important. CPU performance is (almost) all about memory performance nowadays (remember the memory wall graphs!) 68

89 The Problem with OOO Even the fastest OOO machines only get about 1-2 IPC, even though they are 4-5 wide. Problems Insufficient ILP within applications per thread, usually Poor branch prediction performance Single threads also have little memory parallelism. Observation On many cycles, many ALUs and instruction queue slots sit empty 69

90 Simultaneous Multithreading AKA HyperThreading in Intel machines Run multiple threads at the same time Just throw all the instructions into the pipeline Keep some separate data for each Renaming table TLB entries PCs But the rest of the hardware is shared. It is surprisingly simple (but still quite complicated) Fetch T1 Fetch T2 Deco de Rena me Sche dule EX Mem Write back Fetch T3 Fetch T4 Deco de Rena me Sche dule EX Mem Write back 70

91 SMT Advantages Exploit the ILP of multiple threads at once Less dependence or branch prediction (fewer correct predictions required per thread) Less idle hardware (increased power efficiency) Much higher IPC -- up to 4 (in simulation) Disadvantages: threads can fight over resources and slow each other down. Historical footnote: Invented, in part, by our own Dean Tullsen when he was at UW 71

92 Keeping the Window Filled Keeping the instruction window filled is key! Instruction windows are about 32 instructions (size is limited by their complexity, which is considerable) Branches are every 4-5 instructions. This means that the processor predict 6-8 consecutive branches correctly to keep the window full. On a mispredict, you flush the pipeline, which includes the emptying the window. 72

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Multiple Issue (Superscalar and VLIW) Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

CSE 240A Midterm Exam

CSE 240A Midterm Exam Student ID Page 1 of 7 2011 Fall Professor Steven Swanson CSE 240A Midterm Exam Please write your name at the top of each page This is a close book, closed notes exam. No outside material may be used.

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

EECS 452 Lecture 9 TLP Thread-Level Parallelism

EECS 452 Lecture 9 TLP Thread-Level Parallelism EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, ARM Cortex-A53, and Intel Core i7 CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

Dynamic Issue & HW Speculation. Raising the IPC Ceiling

Dynamic Issue & HW Speculation. Raising the IPC Ceiling Dynamic Issue & HW Speculation Today s topics: Superscalar pipelines Dynamic Issue Scoreboarding: control centric approach Tomasulo: data centric approach 1 CS6810 Raising the IPC Ceiling w/ single-issue

More information

Page 1. Raising the IPC Ceiling. Dynamic Issue & HW Speculation. Fix OOO Completion Problem First. Reorder Buffer In Action

Page 1. Raising the IPC Ceiling. Dynamic Issue & HW Speculation. Fix OOO Completion Problem First. Reorder Buffer In Action Raising the IPC Ceiling Dynamic Issue & HW Speculation Today s topics: Superscalar pipelines Dynamic Issue Scoreboarding: control centric approach Tomasulo: data centric approach w/ single-issue IPC max

More information

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections ) Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Static & Dynamic Instruction Scheduling

Static & Dynamic Instruction Scheduling CS3014: Concurrent Systems Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir Roth, Milo Martin and Joe Devietti at University of Pennsylvania 1 Instruction Scheduling

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information