John P. Shen Microprocessor Research Intel Labs March 19, 2002
|
|
- Georgiana Knight
- 6 years ago
- Views:
Transcription
1 &6 0LFURDUFKLWHFWXUH 6XSHUVFDODU 3URFHVVRU'HVLJQ John P. Shen Microprocessor Research Intel Labs March 19, 2002
2 0RRUH V /DZ&RQWLQXHV«Transistors (MT) 10,000 1, Transistors Double Every Two Years P6 486 Pentium Pentium proc 2X Growth in 1.96 Years! Die size (mm) Die Size Grows 14% in Two Years P6 Pentium proc ~7% Growth per Year ~2X Growth in 10 Years Frequency (MHz) 100,000 10,000 1, P6 Pentium proc Frequency Doubles in Two Years 10 Power (Watts) and Power Grows Exponentially P6 Pentium proc Source: Intel Corporation 2
3 «)RU$W/HDVW$QRWKHU'HFDGH Transistors (MT) 10,000 1, ~2B Transistors B 425M Pentium III Pentium Pro 486 Pentium proc Die size (mm) ~40mm Die Pentium Pro proc 486 Pentium proc ~7% growth per year ~2X growth in 10 years Frequency (MHz) 100,000 10,000 1, GHz 14GHz 6.5GHz 3 GHz Pentium III proc Pentium Pro Pentium proc ~30 GHz Power (Watts) 10,000 1, Power Too High Pentium processors Source: Intel Corporation 3
4 0LFURSURFHVVRU3HUIRUPDQFH 4
5 (YROXWLRQRI0LFURSURFHVVRUV Transistor Count 2K-100K 100K-1M 1M-100M 100M-1B Clock Frequency 0.1-3MHz 3-30MHz 30M-1GHz 1-15GHz Instruction/Cycle < (?) 5
6 3HUIRUPDQFH*URZWKLQ3HUVSHFWLYH Doubling every 18 months ( ): - total of 3,200X - Cars travel at 176,000 MPH; get 64,000 miles/gal. - Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) - Wheat yield: 320,000 bushels per acre Doubling every 24 months ( ): - total of 36,000X - Cars travel at 2,400,000 MPH; get 600,000 miles/gal. - Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) - Wheat yield: 3,600,000 bushels per acre Unmatched by any other industry!! [John Crawford, Intel, 1993] 6
7 ,URQ/DZµRI3URFHVVRU3HUIRUPDQFH 1/Processor Performance = Wall-Clock Time Program Instructions Cycles = Program X Instruction X Time Cycle (instr. count) (CPI) (cycle time) 7
8 5HYLHZRI6FDODU3LSHOLQHG 3URFHVVRUV 8
9 6FDODU3LSHOLQHG3URFHVVRUV The 6-stage TYPICAL pipeline: ALU LOAD STORE BRANCH IF: I-CACHE PC I-CACHE PC I-CACHE PC I-CACHE PC IF 1 ID: DECODE DECODE DECODE DECODE ID 2 OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3 ADDR. GEN. ALU 4 EX: ALU OP. RD. MEM. MEM 5 OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6 WR. MEM. WR. PC 9
10 ,QWHULQVWUXFWLRQ'HSHQGHQFHV Š Data dependence r 3 r 1 op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Š Anti-dependence r 3 r 1 op r 2 Write-after-Read r 1 r 4 op r 5 (WAR) Š Output dependence r 3 r 1 op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 Š Control dependence 10
11 $/8,QWHUORFNDQG3HQDOW\ IF ID RD dist=1 dist=2 dist=3 i+1: _ R X i+2: _ R X i+3: _ R X c b a ALU i: R X _ i+1: R y _ i+2: R z _ ALU FORWARDING PATHS MEM WB i: R X _ i+1: R Y _ i: R X _ (i o i+1) Forwarding via Path a (i o i+2) Forwarding via Path b (i o i+3) i writes R1 before i+3 reads R1 11
12 /RDG,QWHUORFNDQG3HQDOW\ IF ID RD dist=1 dist=2 dist=3 i+1: _ R X i+2: _ R X i+3: _ R X e d ALU LOAD FORWARDING PATH(s) MEM i: R X mem[ ] i+1: R y _ i: R X mem[ ] i+2: R z _ i+1: R Y _ WB i: R X mem[ ] (i o i+1) Stall i+1 (i o i+1) Forwarding via Path d (i o i+2) i writes R1 before i+2 reads R1 12
13 0DMRU3HQDOW\/RRSVRI3LSHOLQLQJ IF ID RD 3. LOAD PENALTY 2. ALU PENALTY ALU MEM WB 1. BRANCH PENALTY Performance Objective: Reduce CPI to 1. 13
14 ,QWURGXFWLRQWR0RGHUQ 6XSHUVFDODU 3URFHVVRUV 14
15 /LPLWDWLRQVRI6FDODU3LSHOLQHV Upper Bound on Scalar Pipeline Throughtput Limited by IPC = 1 Inefficient Unification Into Single Pipeline Long latency for each instruction Performance Lost Due to Rigid Pipeline Unnecessary stalls 15
16 6FDODUWR6XSHUVFDODU3LSHOLQHV Parallel Pipeline - Wide pipelines - Advance multiple instructions per cycle Diversified Pipeline - Multiple functional unit types - Mix of different functional units Dynamic Pipeline - Out of order execution - Distributed functional units 16
17 $0RGHUQ6XSHUVFDODU3URFHVVRU Fetch Instruction/Decode Buffer In Order Decode Dispatch Dispatch Buffer Issue Reservation Stations Out of Order Execute In Order Finish Complete Retire Reorder/ Completion Buffer Store Buffer 17
18 )ORZ3DWKVRI6XSHUVFDODUV I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow 18
19 ,QVWUXFWLRQ)ORZ7HFKQLTXHV %UDQFK3HQDOW\ 19
20 :KDW V6R%DG$ERXW%UDQFKHV" Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Issue Branch Reservation Stations Execute Finish Complete Retire Reorder/ Completion Buffer Store Buffer 20
21 5LVHPDQDQG)RVWHU V6WXG\ 7 benchmark programs on CDC-3600 Assume infinite machine: - Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit If bounded to single basic block, i.e. no bypassing of branches maximum speedup is 1.72 Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: Max Speedup:
22 %UDQFK3UHGLFWLRQ Target Address Generation - Access register PC, GP register, Link register - Perform calculation +/- offset, auto incrementing/decrementing Target Speculation Condition Resolution - Access register Condition code register, data register, count register - Perform calculation Comparison of data register(s) Condition Speculation 22
23 7DUJHW$GGUHVV*HQHUDWLRQ Fetch Reg. ind. with offset Reg. ind. PCrel. Decode Dispatch Decode Buffer Dispatch Buffer Issue Branch Reservation Stations Execute Finish Complete Completion Buffer Store Buffer Retire 23
24 &RQGLWLRQ5HVROXWLRQ Fetch Decode Buffer GP reg. value comp. CC reg. Decode Dispatch Dispatch Buffer Issue Branch Reservation Stations Execute Finish Complete Retire Completion Buffer Store Buffer 24
25 %UDQFK,QVWUXFWLRQ6SHFXODWLRQ specu. cond. prediction specu. target Branch Predictor (using a BTB) FA-mux PC npc to Icache npc(seq.) = PC+4 Fetch Decode Buffer BTB update (target addr. and history) Decode Dispatch Dispatch Buffer Issue Branch Reservation Stations Execute Finish Completion Buffer 25
26 ([DPSOH3UHGLFWLRQ$OJRULWKP Prediction accuracy approaches maximum with as few as 2 preceding branch occurrences used as history T N TT TT T TN T TN NT T Results (%) [IBM RS/6000 Study, Nair, 1992] IBM1 IBM2 IBM3 IBM4 DEC CDC T T N N NN N T N last two branches next prediction 26
27 2WKHU3UHGLFWLRQ$OJRULWKPV N N T Saturation Counter t T T N T N t? T t T t? N Hysteresis Counter n? TN T N T n N n? TN T N T n N Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement. 27
28 2SWLPDO3UHGLFWRU([KDXVWLYH6HDUFK There are 2 20 possible state machines of 2-bit predictors Pruning uninteresting and redundant machines leaves 5248 It is possible to exhaustively search and find the optimal predictor for a benchmark predict NT predict T Benchmark Best Pred. % spice2g N T doduc 94.3 * gcc 89.1 * espresso 89.1 * li 87.1 * eqntott 87.9 * Saturation Counter is near optimal in all cases! 28
29 1XPEHURI&RXQWHU%LWV1HHGHG Benchmark Prediction Accuracy (Overall CPI Overhead) 3-bit 2-bit 1-bit 0-bit spice2g (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031) doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022) gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128) espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176) li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049) Branch history table size: Direct-mapped array of 2k entries Programs, like gcc, can have over 7000 conditional branches In collisions, multiple branches share the same predictor Variation of branch penalty with branch history table size level out at
30 +16 %+7DQG%7$& PC icache Branch History Table Branch Target Address Cache Decode Decode Buffer feedback BRN Dispatch Dispatch Buffer Reservation Stations SFX SFX CFX FPU LS Issue Execute Branch Finish Completion Buffer 30
31 33&)HWFK$GGUHVV*HQHUDWLRQ instruction cache BHT BTAC FAR fetch Prediction Logic (4 instructions) Target Seq Addr decode Prediction Logic (4 instructions) Target Seq Addr dispatch branch execute Prediction Logic (4 instructions) Target Seq Addr Target Exception Logic + complete PC 31
32 *OREDO%UDQFK3UHGLFWLRQ So far, the prediction of each static branch instruction is based solely on its own past behavior and not the behaviors of other neighboring static branch instructions Branch History Register (shift left when update) index Pattern History Table (PHT) Branch Resolution old PHT Bits FSM Logic new Prediction 32
33 /HYHO$GDSWLYH3UHGLFWLRQ ><HK Two-level adaptive branch prediction - 1st level: History of last k (dynamic) branches encountered - 2nd level: branch behavior of the last s occurrences of the specific pattern of these k branches - Use a Branch History Register (BHR) in conjunction with a Pattern History Table (PHT) Example: (k=8, s=6) - Last k branches with the behavior ( ) - s-bit History at the entry ( ) is [101010] - Using history, branch prediction algorithm predicts direction of the branch Effectiveness: - Average 97% accuracy for SPEC - Used in the Intel P6 and AMD K6 33
34 1RPHQFODWXUH^*3`$^JSV` PC Pattern History Table (PHT) Branch History Shift Register (BHSR) (shift left when update) To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 2 18 x 2 bits total = 524 kbits P (512x4) BHR: 12 bits; g (1) PHT: 2 12 x 2 bits total = 33 kbits P (512x4) BHR: 6 bits; s (512) PHT: 2 6 x 2 bits total = 78 kbits index Branch Result PHT Bits old FSM Logic new Prediction 34
35 *OREDO%+656FKHPH*$V Branch Address j bits Branch History Shift Register (BHSR) k bits Prediction BHT of 2 x 2 j+k 35
36 3HU%UDQFK%+656FKHPH3$V Branch Address i bits j bits Standard BHT Branch History Shift Register (BHSR) k x 2 i Prediction k bits BHT of 2 x 2 j+k 36
37 Branch Address j bits xor Branch History Shift Register (BHSR) k bits Prediction BHT of 2 x 2 max(j,k) 37
38 5HJLVWHU'DWD)ORZ7HFKQLTXHV $/83HQDOW\ 38
39 ,QWHULQVWUXFWLRQ'HSHQGHQFHV Š Data dependence r 3 r 1 op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Š Anti-dependence r 3 r 1 op r 2 Write-after-Read r 1 r 4 op r 5 (WAR) Š Output dependence r 3 r 1 op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 Š Control dependence 39
40 5HVROYLQJ)DOVH'HSHQGHQFHV (1) R4 R3 + 1 Must Prevent (2) from completing before (1) is dispatched (2) R3 R5 + 1 (1) R3 R3 op R5 R3 Must Prevent (2) from completing before (1) completes (2) R3 R5 + 1 Stalling: delay Dispatching (or write back) of the 2nd instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR) Register Renaming: use a different register (WAW & WAR) 40
41 5HJLVWHU5HQDPLQJ Anti and output dependencies are false dependencies r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 The dependence is on name/location rather than data Given infinite number of registers, anti and output dependencies can always be eliminated Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 - r4 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 -r4 41
42 5HJLVWHU5HQDPLQJ0HFKDQLVPV ARF Map Table Data Busy Tag RRF Register specifier Data Valid Next entry to be allocated Next entry to complete Operand read 42
43 (OHPHQWVRI0RGHUQ0LFURGDWDIORZ inorder out-of-order Allocate Reorder Buffer entries Dispatch Buffer Branch Dispatch Reg. File Integer Integer Float.- Load/ Point Store Reg. Write Back Ren. Reg. Reservation Stations Forwarding results to Res. Sta. & rename registers inorder Compl. Buffer (Reorder Buff.) Complete Managed as a queue; Maintains sequential order of all Instructions in flight ( takeoff = dispatching; landing = completion) 43
44 0HPRU\'DWD)ORZ7HFKQLTXHV /RDG3HQDOW\ 44
45 7RWDO2UGHULQJRI/RDGV 6WRUHV Keep all loads and stores totally in order with respect to each other However, loads and stores can execute out of order with respect to other types of instructions (while obeying register data-dependences) Except, stores must still be held for all previous instructions 45
46 7KH '$;3<µ([DPSOH Y[ i ] = A * X[ i ] + Y[ i ] LD F0, a ADDI R4, Rx, #512 ; last address Loop: LD F2, 0(Rx) ; load X[ i ] MULTD F2, F0, F2 ; A*X[ i ] LD F4, 0(Ry) ; load Y[ i ] ADDD F4, F2, F4 ; A*X[ i ] + Y[ i ] SD F4, 0(Ry) ; store into Y[ i ] ADDI Rx, Rx, #8 ; inc. index to X ADDI Ry, Ry, #8 ; inc. index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done?? LD MULTD ADDD SD LD 46
47 3URFHVVLQJRI/RDG6WRUH,QVWUXFWLRQV inorder Dispatch Buffer Dispatch Arch. RF Reg. Write Back Ren. RF out-of-order Branch Integer Integer Float.- Point Load/ Store Reservation Stations Address Generation Address Translation Memory Access Reorder Buffer Data Memory inorder Store Buffer Complete Retire 47
48 /RDG6WRUH8QLWV 6WRUH%XIIHU Reservation Station Address Generation Store Load Address Generation Address Translation unit unit Address Translation Memory Access Speculative State (Finished) Store Buffer Data Address Committed In-order State (Completed) Store Buffer Data Cache Memory Update 48
49 /RDG%\SDVVLQJ Loads can be allowed to bypass older stores if no aliasing is found - Older stores addresses must be computed before loads can be issued to allow checking for RAW Alternatively, a load can assume no aliasing and bypass older stores speculatively - validation of no aliasing with previous stores must be done and provide mechanism for reversing the effect Stores are kept in ROB (or Finished Store Buffer) until all older instructions complete At completion time, a store is moved to the Completed Store Buffer to wait for turn to access cache Store is consider completed. Latency beyond this point has little effect on the processor throughput 49
50 ,OOXVWUDWLRQRI/RDG%\SDVVLQJ Reservation Station Store unit Tag match Load unit Address (Finished) Store Buffer data addr Data If no match: update destination register (Completed) Store Buffer Data Cache Match/No match 50
51 /RDG)RUZDUGLQJ If a pending load is RAW dependent on an earlier store still in the store buffer, it need not wait till the store is issued to the data cache The load can be directly satisfied from the store buffer if both load and store addresses are valid and the data is available in the store buffer This avoids the latency of accessing the data cache 51
52 ,OOXVWUDWLRQRI/RDG)RUZDUGLQJ Reservation Station Store unit Tag match Load unit Address (Finished) Store Buffer data addr match Data (Completed) Store Buffer If match: forward to destination register Data Cache Match/No match 52
53 6SHFXODWLYH/RDG%\SDVVLQJ Reservation Station (Finished) Store Buffer (Completed) Store Buffer Store Unit data Tag match When load is issued addr Load Unit Tag match At store completion Finished addr data Load Buffer update renamed register Address Data Data Cache Match/No match Yes: load forwarding No: spec. load bypassing Match/No match update in-order state (architectural reg s) yes: flush aliased load and all younger instructions 53
54 'XDO3RUWHG1RQ%ORFNLQJ&DFKH Reservation Station Store Load Load unit unit Address unit Cache miss Address Cache miss (Finished) (Completed) Store Buffer Data Data Data Cache Missed load queue Main Memory 54
55 3UHIHWFKLQJ'DWD&DFKH Branch Predictor I-cache Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations branch integer integer floating store load point Memory Reference Prediction Prefetch Queue Completion Buffer Complete Store Buffer Data Cache Main Memory 55
56 &XUUHQW&KDOOHQJHVLQ 6XSHUVFDODU'HVLJQ 56
57 ,URQ/DZµRI3URFHVVRU3HUIRUPDQFH 1/Processor Performance = Wall-Clock Time Program Instructions Cycles = Program X Instruction X Time Cycle (instr. count) (CPI) (cycle time) Processor Performance = IPC x GHz Instr. Count 57
58 )UHTXHQF\ 3HUIRUPDQFH%RRVW 10,000 1,000 Frequency (MHz) )UHTÃX$UFK )UHTÃ3URFHVV Pentium 4 proc 4X Pentium II and III proc 13X Pentium proc i µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ Frequency Increased 50X 13X due to process technology Additional 4X due to microarchitecture 100 Relative 10 Performance 1 i486 Relative Performance Relative Frequency Pentium proc Pentium 4 proc Pentium II and III proc 1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 6X 13X Performance Increased >75X 13X due to process technology Additional >6X due to microarchitecture *Note: Performance measured using SpecINT and SpecFP Source: Intel Corporation 58
59 )UHTXHQF\YV3DUDOOHOLVP Increase Frequency (GHz) - Deeper Pipelines - Increased Overall Latency - Lower IPC Increase Instruction Parallelism (IPC) - Wider Pipelines - Increased Complexity - Lower GHz 59
60 'HHSHUDQG:LGHU3LSHOLQHV Fetch Dec. Disp. Exec. Mem. Retire Fetch Decode Dispatch Execute Memory Branch Mispredict Penalty Retire 60
61 )URQW(QG3LSH'HSWK3HQDOW\ Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize 61
62 $OOHYLDWH3LSH'HSWK3HQDOW\ Front-End Contraction - Code Re-mapping and Caching - Trace Construction, Caching, Optimization - Leverage Back-End Optimizations Back-End Optimization - Multiple-Branch, Trace, Stream, Prediction - Code Reordering, Alignment, Optimization - Pre-decode, Pre-rename, Pre-scheduling - Memory Pre-fetch Prediction and Control 62
63 ([HFXWLRQ&RUH,PSURYHPHQW Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching 63
64 +RZ'HHS&DQ<RX*R" Frequency CPI Performance Power 15 57? Pipeline Depth [Ed Grochowski, 7/6/01] Source: Intel Corporation 64
65 +RZ0XFK,/3,V7KHUH" Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90 65
66 /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV 63(&LQW 66
67 SPECint95/MHz /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV Landscape of Microprocessor 63(&LQW Families (SPECint95) SPECint PIII 0.05 Athlon 5 PII 0.04 PIII PPro Athlon Pentium 0.02 Alpha AMD-x86 Intel-x Frequency (MHz) ** Data source 67
68 SPECint2000/MHz /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV Landscape of Microprocessor 63(&LQW Families (SPECint2K) e PIII-Xeon 264A B Itanium 800 SPECint C Sparc-III Athlon Pentium 4 Intel-x86 AMD-x86 Alpha PowerPC Sparc IPF Frequency (MHz) ** Data source 68
69 /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV 63(&LQW 69
70 7KH3HQWLXP Š 3URFHVVRU 70
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationEE382A Lecture 3: Superscalar and Out-of-order Processor Basics
EE382A Lecture 3: Superscalar and Out-of-order Processor Basics Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 3-1 Announcements HW1 is due today Hand
More informationSuperscalar Organization
Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average
More informationECE/CS 552: Introduction to Superscalar Processors
ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines
More informationECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 9: Modern Superscalar Out-of-Order Processors John P. Shen & Gregory Kesden September 27, 2017 Lecture #7 Processor Architecture & Design Lecture #8 Pipelined
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationAnnouncements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)
Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis
More informationEN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture
EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and
More informationDay 1: Introduction Course: Superscalar Architecture
Day 1: Introduction Course: Superscalar Architecture 12 th International ACACES Summer School 10-16 July 2016, Fiuggi, Italy Prof. Mikko Lipasti Lecture notes based in part on slides created by John Shen
More informationSuperscalar Organization
Superscalar Organization ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Stage Phase Function performed CPU, circa 1986 IF φ 1 Translate virtual instr. addr. using TLB φ 2 Access
More informationEN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationPipeline Processor Design
Pipeline Processor Design Beyond Pipeline Architecture Virendra Singh Computer Design and Test Lab. Indian Institute of Science Bangalore virendra@computer.org Advance Computer Architecture Branch Hazard
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More informationInstruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.
Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update
More informationBeyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL
Beyond Pipelining Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationThe Pentium II/III Processor Compiler on a Chip
The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda
More informationSuperscalar Processor Design
Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationMulticycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline
//11 Limitations of Our Simple stage Pipeline Diversified Pipelines The Path Toward Superscalar Processors HPCA, Spring 11 Assumes single cycle EX stage for all instructions This is not feasible for Complex
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L10: Branch Prediction Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab2 and prelim grades Back to the regular office hours 2 1 Overview
More informationArchitectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1
Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationAdministrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies
Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationStatic Branch Prediction
Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already
More informationWide Instruction Fetch
Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationInstruction Level Parallelism. Taken from
Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationComplex Pipelines and Branch Prediction
Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationEECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont
GAS STATION Pipelining & Hazards II Fall 208 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith,
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationLecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2
Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationTopics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation
Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW
Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationExecution/Effective address
Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationDynamic Branch Prediction
#1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationLecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism
Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism
More informationLecture 21: Parallelism ILP to Multicores. Parallel Processing 101
18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts
More information15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture
More informationSuperscalar Processor
Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationReduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction
ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationHY425 Lecture 05: Branch Prediction
HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware
More informationEE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University
EE382A Lecture 5: Branch Prediction Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 5-1 Announcements Project proposal due on Mo 10/14 List the group
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationPage 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP
CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCOSC4201. Prof. Mokhtar Aboelaze York University
COSC4201 Chapter 3 Multi Cycle Operations Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RTI) 1 Multicycle Operations More than one function unit, each
More information