OOO Execution and 21264
|
|
- Steven Rich
- 6 years ago
- Views:
Transcription
1 OOO Execution and
2 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? 2
3 Parallelism ET = IC * CPI * CT IC is more or less fixed We have shrunk cycle time as far as we can We have achieved a CPI of 1. Can we get faster? We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP) 2
4 The Basic 5-stage Pipeline Fetch Deco EX Mem Write de back Like an assembly line -- instructions move through in lock step In the best case, it can achieve one instruction per cycle (IPC). In practice, it s much worse -- branches, data hazards, long-latency memory operations cause much lower IPC. We want an IPC > 1!!! 3
5 Approach 1: Widen the pipeline Fetch Decode EX Mem Write PC 2 de inst Two back and PC+4 Fetch 4 values Memory 2 values Fetch Deco EX ops Write de back Process two instructions at once instead of 1 Often 1 odd PC instruction and 1 even PC This keeps the instruction fetch logic simpler. 2-wide, in-order, superscalar processor Potential problems? 4
6 Dual issue: Structural Hazards Structural hazards We might not replicate everything Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place? Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << If an upper instruction needs the lower pipeline, squash the lower instruction 5
7 Dual issue: Structural Hazards Structural hazards We might not replicate everything Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place? Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << If an upper instruction needs the lower pipeline, squash the lower instruction 5
8 Dual issue: Data Hazards The lower instruction may need a value produced by the upper instruction Forwarding cannot help us -- we must stall. Fetch Decode EX Mem Write PC 2 de inst * back and Fetch 4 2 PC+4 values values Fetch Deco EX Write de back << 6
9 Compiling for Dual Issue The compiler should Pair up non-conflicting instructions Align branch targets (by potentially inserting noops above them) These are similar to the rules for VLIW, but they are just guidelines, not rules. 7
10 Beyond Dual Issue Wider pipelines are possible. There is often a separate floating point pipeline. Wide issue leads to hardware complexity Compiling gets harder, too. In practice, processors use of two options if they want more ILP If we can change the ISA: VLIW If we can t: Out-of-order 8
11 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s3 1 2: sub $t2,$s3,$s4 2 3: or $t5,$t1,$t2 3 4: add $t3,$t1,$t2 4 9
12 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s : sub $t2,$s3,$s4 2 3: or $t5,$t1,$t : add $t3,$t1,$t2 4 There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 9
13 Going Out of Order: Data dependence refresher. 1: add $t1,$s2,$s : sub $t2,$s3,$s4 2 3: or $t5,$t1,$t : add $t3,$t1,$t2 We can parallelize instructions 4 There is parallelism!! We can execute 1 & 2 at once and that do not have a read-afterwrite dependence (RAW) 3 & 4 at once 9
14 Data dependences In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously. But beware: Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $t1,$s3,$s4 Can we reorder the instructions? 1 2 2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? 10
15 Data dependences In general, if there is no dependence between two instructions, we can execute them in either order or simultaneously. But beware: Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $t1,$s3,$s4 Can we reorder the instructions? 1 2 2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? No! The final value of $t1 is different 10
16 False Dependence #1 Also called Write-after-Write dependences (WAW) occur when two instructions write to the same value The dependence is false because no data flows between the instructions -- They just produce an output with the same name. 11
17 Beware again! Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $s2,$s3,$s4 1 2 Can we reorder the instructions? 2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3 Is the result the same? 12
18 Beware again! Is there a dependence here? 1: add $t1,$s2,$s3 2: sub $s2,$s3,$s4 1 2 Can we reorder the instructions? 2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3 No! The value in $s2 that 1 needs will be destroyed Is the result the same? 12
19 False Dependence #2 This is a Write-after-Read (WAR) dependence Again, it is false because no data flows between the instructions 13
20 Out-of-Order Execution Any sequence of instructions has set of RAW, WAW, and WAR hazards that constrain its execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences? 14
21 The Central OOO Idea 1. Fetch a bunch of instructions 2. Build the dependence graph 3. Find all instructions with no unmet dependences 4. Execute them. 5. Repeat 15
22 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 WAR WAW RAW 3: or $t3,$t1,$t2 4: add $t5,$t1,$t2 16
23 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t2 3 4: add $t5,$t1,$t2 4 16
24 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t
25 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2 8: add $t3,$t5,$t
26 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t2 5: or $t4,$s1,$s : mul $t2,$t3,$s5 6 7: sl $t3,$t4,$t2 7 8: add $t3,$t5,$t1 8 16
27 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t
28 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t
29 Example 1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 1 2 WAR WAW RAW 3: or $t3,$t1,$t : add $t5,$t1,$t : or $t4,$s1,$s : mul $t2,$t3,$s : sl $t3,$t4,$t : add $t3,$t5,$t Instructions in 5 cycles 16
30 Simplified OOO Pipeline A new schedule stage manages the Instruction Window The window holds the set of instruction the processor examines The fetch and decode fill the window Execute stage drains it Typically, OOO pipelines are also wide but it is not necessary. Impacts More forwarding, More stalls, longer branch resolution Fundamentally more work per instruction. Fetch Deco Sche EX Mem Write de dule back 17
31 The Instruction Window The Instruction Window is the set of instruction the processor examines The fetch and decode fill the window Execute stage drains it The larger the window, the more parallelism the processor can find, but... Keeping the window filled is a challenge 18
32 Case Study: Alpha 21264
33 Digital Equipment Corporation One of the Big Old Computer companies (along with IBM) Business-oriented computers Check out Gordon Bell s lecture in History of Computing class They produced a string of famous machines Sold to Compaq in 1998 Sold to HP (and Intel) in 2002
34 The PDPs Most famous: PDP-11 Birthplace of UNIX Elegant ISA Designed by a small team in short order In response to competitor Formed by defecting engineers 16 bits of virtual address PDP-5 and PDP-8 were 12 bits Chronically short of address bits Sold until 1997
35 The VAX (In)famous and long-lived for "Virtual Address Extension (to the PDP-11) LOTS of extensions Very CISCy -- polynomial evaluate inst. Etc.
36 The Alpha Four processors 21064, 21164, 21264, 21364, (21464) 21 for 21st century ; 64 - for 4 bit High-end workstations/servers Fast processors in the world at introduction Unix, VMS (old VAX OS), WindowsNT, Linux Alpha died when Intel bought the IP and the design team.
37 AlphaAXP New ISA from scratch No legacy anything (almost) VAX-style floating point mode 64-bit Very clean RISC ISA Register-Register/Load-Store No condition codes Conditional moves -- reduced branching, but at what cost? 32 GPRs and FPRs OS support PALCode -- firmware control of low-level hardware VAX compatibility provided in software VAX ISA -> Alpha via a compiler
38 Alpha Introduced in Mhz (blazingly fast at the time) 750nm/0.75micron (vs 45nm today) 234mm 2 die, 1.6M transistors 33 Watts Full custom design
39 Alpha (cont) Pipeline Dual issue 7 stage integer/10 stage FP 4 cycle mis-prediction penalty. 45 bypassing paths 22 instructions in flight Caches On-chip L1I + L1D. 8KB each Off-chip L2 Branch prediction Static: forward taken/back not taken Simple dynamic prediction 80% accuracy
40 Alpha Introduced in Mhz 500nm/0.5micron 299mm 2 die, 9.7M transistors 56W
41 Alpha (cont) Pipeline Quad issue: 2 integer + 2 FP 7 stage integer/10 stage FP Caches On-chip L1I + L1D. 8KB each. Direct-mapped (fast!) Hit under miss/miss under miss (21 outstanding at once) On-chip 3-way 96KB L2. Off-chip L3 (1-64MB) ISA changes Native support for byte operations Branch prediction 5 cycle mispredict penalty History-based dynamic predictor. Bits stored per cache line.
42 Alpha Introduced in Mhz-1.2Ghz micron 314mm 2 die, 15.2M transistors 73W
43 Alpha (cont) Pipeline 6-issue: 4 integer + 2 FP 7 stage integer/longer for FP, depending or op. 80 in-flight instructions Caches On-chip L1I + L1D. 64KB each. 2-way Off-chip L2 Compared to x the L1 capacity, but no onchip L2
44 Aggressive Speculation The executes instructions that may or may not be on the correct path. When it s wrong, it has to undo those instructions It stores backups of renaming tables, register file, etc. It also must prevent changes to memory from occurring until the instructions commit 31
45 In Order Fetch and Commit Fetch is in-order Execution is out of order Extract as much parallelism as possible Commit is in-order Make the changes permanent in program order. This is what is visible to the programmer. This enables precise exceptions (mostly) 32
46 Alpha (cont) Fetch unit Pre-decodes instructions in the Icache next line and set predictors -- correct % Tournament predictor A local history predictor + A global history predictor A third predictor to track which one is most effective 2 cycle to make a prediction 33
47 Alpha 21264: I Cache/fetch 64KB, 2-way, 16byte lines (4 instructions) Each line also contains extra information: Instructions Next Line Next Way Pre-decoded bits Incorporates BTB and parts of instruction decode BTB data is protected by 2-bits of hysteresis, trained by branch predictor. Branch prediction is aggressive to find parallelism and exploit speculative out-of-order execution. We wants lots of instructions in flight. On a miss, it prefetches up to 64 instructions
48 Alpha Slot Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add
49 Alpha Slot Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add
50 Alpha Slot Out-of-order Fetch Rename Issue Reg Read Execute Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add
51 Slot Out-of-order Fetch Rename Issue Alpha Reg Read Execute Cluster Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add
52 Slot Out-of-order Fetch Rename Issue Alpha Reg Read Execute Cluster Memory Branch Predictor Int reg rename Int IQ 20 entries Int Reg File (80) ALU ALU Dual ported L1 Next line/ Set prediction Int Reg File (80) ALU ALU L1D 64KB 2-way L2 96KB 3-way L1I 64KB, 2-way enriched L1 Icache FP reg rename FP IQ 15 entries FP Reg File (72) FP Mult FP Add
53 How Much Parallelism is There? Not much, in the presence of WAW and WAR dependences. These arise because we must reuse registers, and there are a limited number we can freely reuse. How can we get rid of them? 36
54 Removing False Dependences If WAW and WAR dependences arise because we have too few registers Let s add more! But! We can t! The Architecture only gives us 32 (why or why did we only use 5 bits?) Solution: Define a set of internal physical register that is as large as the number of instructions that can be in flight in the latest intel chip. Every instruction in the pipeline gets a registers Maintaining a register mapping table that determines which physical register currently holds the value for the required architectural registers. This is called Register Renaming 37
55 Alpha 21264: Renaming Separate INT and FP Replaces architectural registers with physical registers 80 integer physical registers 72 FP physical registers Eliminates WAW and WAR hazards Register map table maintains mapping between architectural and physical registers One copy for each in-flight instruction (80 copies) Special handling for conditional moves.
56 Alpha 21264: Renaming Two parts Content-addressable lookup to find physical register inputs Register allocation to rename the output Four instructions can be renamed each cycle. 8 ports on the lookup table 4 allocations per cycle There is no fixed location for architectural register values! How can we read architectural register r10?
57 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 Alpha 21264: Renaming Register map table 1: r1 r2 r3 p1 p2 p : 3: 4: 5: 5 RAW WAW WAR
58 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : 3: 4: 5: 5 RAW WAW WAR
59 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: 4: 5: 5 RAW WAW WAR
60 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: 5: 5 RAW WAW WAR
61 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: 5 RAW WAW WAR
62 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 p8, p6, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: p6 p8 p4 5 RAW WAW WAR
63 Alpha 21264: Renaming 1: Add r3, r2, r3 2: Sub r2, r1, r3 3: Mult r1, r3, r1 4: Add r2, r3, r1 5: Add r2, r1, r3 p4, p2, p3 p5, p1, p4 p6, p4, p1 p7, p4, p6 p8, p6, p4 r1 r2 r3 0: p1 p2 p3 1: p1 p2 p : p1 p5 p4 3: p6 p5 p4 4: p6 p7 p4 5: p6 p8 p4 RAW WAW WAR
64 Alpha 21264: Issue Queue Separate Int and FP Decouple front and back ends Dynamically track dependences Instructions can issue once their input registers are written Track register status in register scoreboard Issue instructions around long-latency operations Exploit cross-loop parallelism Issue up to 4 instructions/cycle (2 floating point) Issue oldest first Compact the queue (the free slots are always mostly at the top)
65 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU
66 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU 4 5
67 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p8 1 3 Register File ALU ,- ALU 4 5
68 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p8 1 Register File ALU ,3 ALU 4 5
69 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU 2 ALU 3 4 5
70 Alpha 21264: Issue Queue Register scoreboard 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 p1 p2 p3 p4 p5 p6 p7 p Register File ALU ALU 3 4 5
71 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File 5,4 ALU ALU 3
72 Alpha 21264: Issue Queue 1: Add p4, p2, p3 2: Sub p5, p1, p4 3: Mult p6, p4, p1 4: Add p7, p4, p6 5: Add p8, p6, p4 Register scoreboard p1 p2 p3 p4 p5 p6 p7 p Register File ALU 5 ALU 4
73 The Issue Window Decoded Instruction data alu_out_dst_0 vrs vrt alu_out_dst_1 alu_out_value_0 alu_out_value_1 opcode etc vrs vrt rs_value valid rt_value valid = = = = Ready rt_value rs_value opcode 56
74 The Issue Window ALU0 insts Arbitration ALU1 57
75 Alpha 21264: Execution Integer ALUs are clustered Two ALUs share a complete replica of the Int register file 1 cycle extra latency for cross-cluster updates Not a big performance hit Issue queue can issue any instruction to either cluster Critical paths tend to stay in one cluster Area savings Register file size is quadratic in # of ports Each replica needs 4 read, 4 write ports (2 local writes, 2 remote) Unclustered -> 8 read, 4 write ports O(2*8 2 ) vs O(12 2 ) Simpler too. This is the beginning of the slow wires problem
76 Alpha 21264: Memory Interface Memory is king!!! One of Alpha s niche markets was large, memory-intensive applications They went 64-bits for the physical address space as much as for the virtual. Lots of outstanding requests 32 loads, 32 stores (D only) 8 cache misses (I + D) Big caches (64KB, 2-way) What does Patterson s thumb say? 2 loads/stores per cycle Double-pumped instead of multi-ported. (area vs clock rate) Virtually-index, physically tagged 8-entry victim buffer shared between L1I and L1D
77 Alpha 21264: Memory interface Memory ordering Renames memory locations LDQ/STQ 32 entries each. Sorted in fetch order (but arrive out-of-order) Instruction remain in the queues until retirement Load watch for younger stores to the same address Squash the load and subsequent instructions if a match occurs Stores watch for younger stores Speculative loads get speculative data from speculative store data buffer
78 61
79 Alpha 21264: Retirement Instructions retire in-order At retirement Stores write to memory Renamed registers are released Each instruction carries the physical register number that held the previous value for the instruction s architectural destination register. Since retirement is in-order, that register is dead. On exceptions, All younger instructions are squashed Register map reverts to state before the exception.
80 Alpha 21264: Memory interface Ordering violations Source: ST r0, 0(r10) LD r1, 0(r11) Execution: LD r1, 0(r11) ST r0, 0(r10) R11 == r10 => violation, pipe flush Mark the Load as delayed In the future, it will wait for all previous stores Clear the delayed flag ever 16,384 cycles
81 Alpha 21264: Memory Interface Speculative cache hits (integer only) The instruction queue assumes loads hit the L1 When they don t hit, do a mini-restart Up to 8 instructions are pulled back into the issue queue to be reissued Results in a 2 cycle bubble A single 4-bit predictor tracks the miss behavior.
82 Alpha Introduced Ghz 0.18micron, 130M transistors 400mm Watts MB on-chip L2 Essentially a with an on-chip cache.
83
84
85 300MHz 1.7x improvement
86 300MHz 1.7x improvement 600MHz 1.8x improvement
87 600MHz 1.8x improvement 300MHz 1.7x improvement 27.8x improvement 8.3x cycle time improvement 3.5x from architecture
88 Modern OOO Processors The fastest machines in the world are OOO superscalars AMD Barcelona 6-wide issue 106 instructions inflight at once. Intel Nehalem 5-way issue to 12 ALUs > 128 instructions in flight OOO provides the most benefit for memory operations. Non-dependent instructions can keep executing during cache misses. THis is so-called memory-level parallelism. It is enormously important. CPU performance is (almost) all about memory performance nowadays (remember the memory wall graphs!) 68
89 The Problem with OOO Even the fastest OOO machines only get about 1-2 IPC, even though they are 4-5 wide. Problems Insufficient ILP within applications per thread, usually Poor branch prediction performance Single threads also have little memory parallelism. Observation On many cycles, many ALUs and instruction queue slots sit empty 69
90 Simultaneous Multithreading AKA HyperThreading in Intel machines Run multiple threads at the same time Just throw all the instructions into the pipeline Keep some separate data for each Renaming table TLB entries PCs But the rest of the hardware is shared. It is surprisingly simple (but still quite complicated) Fetch T1 Fetch T2 Deco de Rena me Sche dule EX Mem Write back Fetch T3 Fetch T4 Deco de Rena me Sche dule EX Mem Write back 70
91 SMT Advantages Exploit the ILP of multiple threads at once Less dependence or branch prediction (fewer correct predictions required per thread) Less idle hardware (increased power efficiency) Much higher IPC -- up to 4 (in simulation) Disadvantages: threads can fight over resources and slow each other down. Historical footnote: Invented, in part, by our own Dean Tullsen when he was at UW 71
92 Keeping the Window Filled Keeping the instruction window filled is key! Instruction windows are about 32 instructions (size is limited by their complexity, which is considerable) Branches are every 4-5 instructions. This means that the processor predict 6-8 consecutive branches correctly to keep the window full. On a mispredict, you flush the pipeline, which includes the emptying the window. 72
CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationChapter. Out of order Execution
Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationComputer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationCSE 240A Midterm Exam
Student ID Page 1 of 7 2011 Fall Professor Steven Swanson CSE 240A Midterm Exam Please write your name at the top of each page This is a close book, closed notes exam. No outside material may be used.
More informationCS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationCourse on Advanced Computer Architectures
Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationHANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding
HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More informationEECS 452 Lecture 9 TLP Thread-Level Parallelism
EECS 452 Lecture 9 TLP Thread-Level Parallelism Instructor: Gokhan Memik EECS Dept., Northwestern University The lecture is adapted from slides by Iris Bahar (Brown), James Hoe (CMU), and John Shen (CMU
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationModule 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.
Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationLecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,
Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, ARM Cortex-A53, and Intel Core i7 CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationDynamic Issue & HW Speculation. Raising the IPC Ceiling
Dynamic Issue & HW Speculation Today s topics: Superscalar pipelines Dynamic Issue Scoreboarding: control centric approach Tomasulo: data centric approach 1 CS6810 Raising the IPC Ceiling w/ single-issue
More informationPage 1. Raising the IPC Ceiling. Dynamic Issue & HW Speculation. Fix OOO Completion Problem First. Reorder Buffer In Action
Raising the IPC Ceiling Dynamic Issue & HW Speculation Today s topics: Superscalar pipelines Dynamic Issue Scoreboarding: control centric approach Tomasulo: data centric approach w/ single-issue IPC max
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationStatic & Dynamic Instruction Scheduling
CS3014: Concurrent Systems Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir Roth, Milo Martin and Joe Devietti at University of Pennsylvania 1 Instruction Scheduling
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More information