OOO Execution and 21264

Size: px

Start display at page:

Download "OOO Execution and 21264"

Steven Rich
6 years ago
Views:

1 OOO Execution and

2 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? 2

3 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP) 2

4 The Basic 5-stage Pipeline Fetch Deco EX Mem Write de back Like an assembly line -- instructions move through in lock step In the best case, it can achieve one instruction per cycle (IPC). In practice, it s much worse -- branches, data hazards, long-latency memory operations cause much lower IPC. We want an IPC > 1!!! 3

5 Approach 1: Widen the pipeline Fetch Decode EX Mem Write PC 2 de inst Two back and PC+4 Fetch 4 values Memory 2 values Fetch Deco EX ops Write de back Process two instructions at once instead of 1 Often 1 odd PC instruction and 1 even PC This keeps the instruction fetch logic simpler. 2-wide, in-order, superscalar processor Potential problems? 4

6 Dual issue: Structural Hazards Structural hazards We might not replicate everything Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place? Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << If an upper instruction needs the lower pipeline, squash the lower instruction 5

7 Dual issue: Structural Hazards Structural hazards We might not replicate everything Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place? Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << If an upper instruction needs the lower pipeline, squash the lower instruction 5

8 Dual issue: Data Hazards The lower instruction may need a value produced by the upper instruction Forwarding cannot help us -- we must stall. Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << 6

9 Compiling for Dual Issue The compiler should Pair up non-conflicting instructions Align branch targets (by potentially inserting noops above them) These are similar to the rules for VLIW, but they are just guidelines, not rules. 7

10 Beyond Dual Issue Wider pipelines are possible. There is often a separate floating point pipeline. Wide issue leads to hardware complexity Compiling gets harder, too. In practice, processors use of two options if they want more ILP If we can change the ISA: VLIW If we can t: Out-of-order 8

11 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s3 1 2: sub $t2,$s3,$s4 2 3: or $t5,$t1,$t2 3 4: add $t3,$t1,$t2 4 9

12 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s : sub $t2,$s3,$s4 2 3: or $t5,$t1,$t : add $t3,$t1,$t2 4 There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 9

13 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s : sub $t2,$s3,$s4 2 3: or $t5,$t1,$t : add $t3,$t1,$t2 We can parallelize instructions 4 There is parallelism!! We can execute 1 & 2 at once and that do not have a read-afterwrite dependence (RAW) 3 & 4 at once 9

14 Data dependences In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously. But beware: Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $t1,$s3,$s4 Can we reorder the instructions? 1 2 2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? 10

15 Data dependences In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously. But beware: Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $t1,$s3,$s4 Can we reorder the instructions? 1 2 2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? No! The final value of $t1 is different 10

16 False Dependence #1 Also called Write-after-Write dependences (WAW) occur when two instructions write to the same value The dependence is false because no data flows between the instructions -- They just produce an output with the same name. 11

17 Beware again! Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $s2,$s3,$s4 1 2 Can we reorder the instructions? 2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? 12

18 Beware again! Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $s2,$s3,$s4 1 2 Can we reorder the instructions? 2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3 No! The value in $s2 that 1 needs will be destroyed Is the result the same? 12

19 False Dependence #2 This is a Write-after-Read (WAR) dependence Again, it is false because no data flows between the instructions 13

20 Out-of-Order Execution Any sequence of instructions has set of RAW, WAW, and WAR hazards that constrain its execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences? 14

21 The Central OOO Idea 1. Fetch a bunch of instructions 2. Build the dependence graph 3. Find all instructions with no unmet dependences 4. Execute them. 5. Repeat 15

22 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 WAR WAW RAW 3: or $t3,$t1,$t2 4: add $t5,$t1,$t2 16

23 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t2 3 4: add $t5,$t1,$t2 4 16

24 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t

25 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2 8: add $t3,$t5,$t

26 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t2 5: or $t4,$s1,$s : mul $t2,$t3,$s5 6 7: sl $t3,$t4,$t2 7 8: add $t3,$t5,$t1 8 16

27 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t

28 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t

29 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t Instructions in 5 cycles 16

30 Simplified OOO Pipeline A new schedule stage manages the Instruction Window The window holds the set of instruction the processor examines The fetch and decode fill the window Execute stage drains it Typically, OOO pipelines are also wide but it is not necessary. Impacts More forwarding, More stalls, longer branch resolution Fundamentally more work per instruction. Fetch Deco Sche EX Mem Write de dule back 17

31 The Instruction Window The Instruction Window is the set of instruction the processor examines The fetch and decode fill the window Execute stage drains it The larger the window, the more parallelism the processor can find, but... Keeping the window filled is a challenge 18

32 Case Study: Alpha 21264

33 Digital Equipment Corporation One of the Big Old Computer companies (along with IBM) Business-oriented computers Check out Gordon Bell s lecture in History of Computing class They produced a string of famous machines Sold to Compaq in 1998 Sold to HP (and Intel) in 2002

34 The PDPs Most famous: PDP-11 Birthplace of UNIX Elegant ISA Designed by a small team in short order In response to competitor Formed by defecting engineers 16 bits of virtual address PDP-5 and PDP-8 were 12 bits Chronically short of address bits Sold until 1997

35 The VAX (In)famous and long-lived for "Virtual Address Extension (to the PDP-11) LOTS of extensions Very CISCy -- polynomial evaluate inst. Etc.

36 The Alpha Four processors 21064, 21164, 21264, 21364, (21464) 21 for 21st century ; 64 - for 4 bit High-end workstations/servers Fast processors in the world at introduction Unix, VMS (old VAX OS), WindowsNT, Linux Alpha died when Intel bought the IP and the design team.

37 AlphaAXP New ISA from scratch No legacy anything (almost) VAX-style floating point mode 64-bit Very clean RISC ISA Register-Register/Load-Store No condition codes Conditional moves -- reduced branching, but at what cost? 32 GPRs and FPRs OS support PALCode -- firmware control of low-level hardware VAX compatibility provided in software VAX ISA -> Alpha via a compiler

Alpha 21064 Introduced in 1991 100-300Mhz (blazingly fast at the time) 750nm/0.

38 Alpha Introduced in Mhz (blazingly fast at the time) 750nm/0.75micron (vs 45nm today) 234mm 2 die, 1.6M transistors 33 Watts Full custom design

39 Alpha (cont) Pipeline Dual issue 7 stage integer/10 stage FP 4 cycle mis-prediction penalty. 45 bypassing paths 22 instructions in flight Caches On-chip L1I + L1D. 8KB each Off-chip L2 Branch prediction Static: forward taken/back not taken Simple dynamic prediction 80% accuracy

40 Alpha Introduced in Mhz 500nm/0.5micron 299mm 2 die, 9.7M transistors 56W

41 Alpha (cont) Pipeline Quad issue: 2 integer + 2 FP 7 stage integer/10 stage FP Caches On-chip L1I + L1D. 8KB each. Direct-mapped (fast!) Hit under miss/miss under miss (21 outstanding at once) On-chip 3-way 96KB L2. Off-chip L3 (1-64MB) ISA changes Native support for byte operations Branch prediction 5 cycle mispredict penalty History-based dynamic predictor. Bits stored per cache line.

42 Alpha Introduced in Mhz-1.2Ghz micron 314mm 2 die, 15.2M transistors 73W

43 Alpha (cont) Pipeline 6-issue: 4 integer + 2 FP 7 stage integer/longer for FP, depending or op. 80 in-flight instructions Caches On-chip L1I + L1D. 64KB each. 2-way Off-chip L2 Compared to x the L1 capacity, but no onchip L2

44 Aggressive Speculation The executes instructions that may or may not be on the correct path. When it s wrong, it has to undo those instructions It stores backups of renaming tables, register file, etc. It also must prevent changes to memory from occurring until the instructions commit 31

45 In Order Fetch and Commit Fetch is in-order Execution is out of order Extract as much parallelism as possible Commit is in-order Make the changes permanent in program order. This is what is visible to the programmer. This enables precise exceptions (mostly) 32

Alpha 21264 (cont) Fetch unit Pre-decodes instructions in the Icache next line and set predictors -- correct 80-100% Tournament predictor A

46 Alpha (cont) Fetch unit Pre-decodes instructions in the Icache next line and set predictors -- correct % Tournament predictor A local history predictor + A global history predictor A third predictor to track which one is most effective 2 cycle to make a prediction 33

47 Alpha 21264: I Cache/fetch 64KB, 2-way, 16byte lines (4 instructions) Each line also contains extra information: Instructions Next Line Next Way Pre-decoded bits Incorporates BTB and parts of instruction decode BTB data is protected by 2-bits of hysteresis, trained by branch predictor. Branch prediction is aggressive to find parallelism and exploit speculative out-of-order execution. We wants lots of instructions in flight. On a miss, it prefetches up to 64 instructions

48 Alpha Slot Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

49 Alpha Slot Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

50 Alpha Slot Out-of-order Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

51 Slot Out-of-order Fetch Rename Issue Alpha Reg Read Execute Cluster Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

52 Slot Out-of-order Fetch Rename Issue Alpha Reg Read Execute Cluster Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Dual ported L1 Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add

53 How Much Parallelism is There? Not much, in the presence of WAW and WAR dependences. These arise because we must reuse registers, and there are a limited number we can freely reuse. How can we get rid of them? 36

54 Removing False Dependences If WAW and WAR dependences arise because we have too few registers Let s add more! But! We can t! The Architecture only gives us 32 (why or why did we only use 5 bits?) Solution: Define a set of internal physical register that is as large as the number of instructions that can be in flight in the latest intel chip. Every instruction in the pipeline gets a registers Maintaining a register mapping table that determines which physical register currently holds the value for the required architectural registers. This is called Register Renaming 37

55 Alpha 21264: Renaming Separate INT and FP Replaces architectural registers with physical registers 80 integer physical registers 72 FP physical registers Eliminates WAW and WAR hazards Register map table maintains mapping between architectural and physical registers One copy for each in-flight instruction (80 copies) Special handling for conditional moves.

56 Alpha 21264: Renaming Two parts Content-addressable lookup to find physical register inputs Register allocation to rename the output Four instructions can be renamed each cycle. 8 ports on the lookup table 4 allocations per cycle There is no fixed location for architectural register values! How can we read architectural register r10?

57 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 Alpha 21264: Renaming Register map table 1: r1 r2 r3 p1 p2 p : 3: 4: 5: 5 RAW WAW WAR

58 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : 3: 4: 5: 5 RAW WAW WAR

59 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: 4: 5: 5 RAW WAW WAR

60 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: 5: 5 RAW WAW WAR

61 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: 5 RAW WAW WAR

62 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 p8, p6, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: p6 p8 p4 5 RAW WAW WAR

63 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 p8, p6, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: p6 p8 p4 RAW WAW WAR

64 Alpha 21264: Issue Queue Separate Int and FP Decouple front and back ends Dynamically track dependences Instructions can issue once their input registers are written Track register status in register scoreboard Issue instructions around long-latency operations Exploit cross-loop parallelism Issue up to 4 instructions/cycle (2 floating point) Issue oldest first Compact the queue (the free slots are always mostly at the top)

65 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU

66 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU 4 5

67 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p8 1 3 Register File ALU ,- ALU 4 5

68 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p8 1 Register File ALU ,3 ALU 4 5

69 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU 2 ALU 3 4 5

70 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU 3 4 5

71 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File 5,4 ALU ALU 3

72 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File ALU 5 ALU 4

73 The Issue Window Decoded Instruction data alu_out_dst_0 vrs vrt alu_out_dst_1 alu_out_value_0 alu_out_value_1 opcode etc vrs vrt rs_value valid rt_value valid = = = = Ready rt_value rs_value opcode 56

74 The Issue Window ALU0 insts Arbitration ALU1 57

Alpha 21264: Execution Integer ALUs are clustered Two ALUs share a complete replica of the Int register file 1 cycle extra latency for cross-cluster updates Not a big performance hit Issue queue can

75 Alpha 21264: Execution Integer ALUs are clustered Two ALUs share a complete replica of the Int register file 1 cycle extra latency for cross-cluster updates Not a big performance hit Issue queue can issue any instruction to either cluster Critical paths tend to stay in one cluster Area savings Register file size is quadratic in # of ports Each replica needs 4 read, 4 write ports (2 local writes, 2 remote) Unclustered -> 8 read, 4 write ports O(2*8 2 ) vs O(12 2 ) Simpler too. This is the beginning of the slow wires problem

76 Alpha 21264: Memory Interface Memory is king!!! One of Alpha s niche markets was large, memory-intensive applications They went 64-bits for the physical address space as much as for the virtual. Lots of outstanding requests 32 loads, 32 stores (D only) 8 cache misses (I + D) Big caches (64KB, 2-way) What does Patterson s thumb say? 2 loads/stores per cycle Double-pumped instead of multi-ported. (area vs clock rate) Virtually-index, physically tagged 8-entry victim buffer shared between L1I and L1D

77 Alpha 21264: Memory interface Memory ordering Renames memory locations LDQ/STQ 32 entries each. Sorted in fetch order (but arrive out-of-order) Instruction remain in the queues until retirement Load watch for younger stores to the same address Squash the load and subsequent instructions if a match occurs Stores watch for younger stores Speculative loads get speculative data from speculative store data buffer

78 61

79 Alpha 21264: Retirement Instructions retire in-order At retirement Stores write to memory Renamed registers are released Each instruction carries the physical register number that held the previous value for the instruction s architectural destination register. Since retirement is in-order, that register is dead. On exceptions, All younger instructions are squashed Register map reverts to state before the exception.

80 Alpha 21264: Memory interface Ordering violations Source: ST r0, 0(r10) LD r1, 0(r11) Execution: LD r1, 0(r11) ST r0, 0(r10) R11 == r10 => violation, pipe flush Mark the Load as delayed In the future, it will wait for all previous stores Clear the delayed flag ever 16,384 cycles

81 Alpha 21264: Memory Interface Speculative cache hits (integer only) The instruction queue assumes loads hit the L1 When they don t hit, do a mini-restart Up to 8 instructions are pulled back into the issue queue to be reissued Results in a 2 cycle bubble A single 4-bit predictor tracks the miss behavior.

82 Alpha Introduced Ghz 0.18micron, 130M transistors 400mm Watts MB on-chip L2 Essentially a with an on-chip cache.

85 300MHz 1.7x improvement

86 300MHz 1.7x improvement 600MHz 1.8x improvement

87 600MHz 1.8x improvement 300MHz 1.7x improvement 27.8x improvement 8.3x cycle time improvement 3.5x from architecture

88 Modern OOO Processors The fastest machines in the world are OOO superscalars AMD Barcelona 6-wide issue 106 instructions inflight at once. Intel Nehalem 5-way issue to 12 ALUs > 128 instructions in flight OOO provides the most benefit for memory operations. Non-dependent instructions can keep executing during cache misses. THis is so-called memory-level parallelism. It is enormously important. CPU performance is (almost) all about memory performance nowadays (remember the memory wall graphs!) 68

89 The Problem with OOO Even the fastest OOO machines only get about 1-2 IPC, even though they are 4-5 wide. Problems Insufficient ILP within applications per thread, usually Poor branch prediction performance Single threads also have little memory parallelism. Observation On many cycles, many ALUs and instruction queue slots sit empty 69

90 Simultaneous Multithreading AKA HyperThreading in Intel machines Run multiple threads at the same time Just throw all the instructions into the pipeline Keep some separate data for each Renaming table TLB entries PCs But the rest of the hardware is shared. It is surprisingly simple (but still quite complicated) Fetch T1 Fetch T2 Deco de Rena me Sche dule EX Mem Write back Fetch T3 Fetch T4 Deco de Rena me Sche dule EX Mem Write back 70

91 SMT Advantages Exploit the ILP of multiple threads at once Less dependence or branch prediction (fewer correct predictions required per thread) Less idle hardware (increased power efficiency) Much higher IPC -- up to 4 (in simulation) Disadvantages: threads can fight over resources and slow each other down. Historical footnote: Invented, in part, by our own Dean Tullsen when he was at UW 71

92 Keeping the Window Filled Keeping the instruction window filled is key! Instruction windows are about 32 instructions (size is limited by their complexity, which is considerable) Branches are every 4-5 instructions. This means that the processor predict 6-8 consecutive branches correctly to keep the window full. On a mispredict, you flush the pipeline, which includes the emptying the window. 72

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic