John P. Shen Microprocessor Research Intel Labs March 19, 2002

Size: px
Start display at page:

Download "John P. Shen Microprocessor Research Intel Labs March 19, 2002"

Transcription

1 &6 0LFURDUFKLWHFWXUH 6XSHUVFDODU 3URFHVVRU'HVLJQ John P. Shen Microprocessor Research Intel Labs March 19, 2002

2 0RRUH V /DZ&RQWLQXHV«Transistors (MT) 10,000 1, Transistors Double Every Two Years P6 486 Pentium Pentium proc 2X Growth in 1.96 Years! Die size (mm) Die Size Grows 14% in Two Years P6 Pentium proc ~7% Growth per Year ~2X Growth in 10 Years Frequency (MHz) 100,000 10,000 1, P6 Pentium proc Frequency Doubles in Two Years 10 Power (Watts) and Power Grows Exponentially P6 Pentium proc Source: Intel Corporation 2

3 «)RU$W/HDVW$QRWKHU'HFDGH Transistors (MT) 10,000 1, ~2B Transistors B 425M Pentium III Pentium Pro 486 Pentium proc Die size (mm) ~40mm Die Pentium Pro proc 486 Pentium proc ~7% growth per year ~2X growth in 10 years Frequency (MHz) 100,000 10,000 1, GHz 14GHz 6.5GHz 3 GHz Pentium III proc Pentium Pro Pentium proc ~30 GHz Power (Watts) 10,000 1, Power Too High Pentium processors Source: Intel Corporation 3

4 0LFURSURFHVVRU3HUIRUPDQFH 4

5 (YROXWLRQRI0LFURSURFHVVRUV Transistor Count 2K-100K 100K-1M 1M-100M 100M-1B Clock Frequency 0.1-3MHz 3-30MHz 30M-1GHz 1-15GHz Instruction/Cycle < (?) 5

6 3HUIRUPDQFH*URZWKLQ3HUVSHFWLYH Doubling every 18 months ( ): - total of 3,200X - Cars travel at 176,000 MPH; get 64,000 miles/gal. - Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) - Wheat yield: 320,000 bushels per acre Doubling every 24 months ( ): - total of 36,000X - Cars travel at 2,400,000 MPH; get 600,000 miles/gal. - Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) - Wheat yield: 3,600,000 bushels per acre Unmatched by any other industry!! [John Crawford, Intel, 1993] 6

7 ,URQ/DZµRI3URFHVVRU3HUIRUPDQFH 1/Processor Performance = Wall-Clock Time Program Instructions Cycles = Program X Instruction X Time Cycle (instr. count) (CPI) (cycle time) 7

8 5HYLHZRI6FDODU3LSHOLQHG 3URFHVVRUV 8

9 6FDODU3LSHOLQHG3URFHVVRUV The 6-stage TYPICAL pipeline: ALU LOAD STORE BRANCH IF: I-CACHE PC I-CACHE PC I-CACHE PC I-CACHE PC IF 1 ID: DECODE DECODE DECODE DECODE ID 2 OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3 ADDR. GEN. ALU 4 EX: ALU OP. RD. MEM. MEM 5 OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6 WR. MEM. WR. PC 9

10 ,QWHULQVWUXFWLRQ'HSHQGHQFHV Š Data dependence r 3 r 1 op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Š Anti-dependence r 3 r 1 op r 2 Write-after-Read r 1 r 4 op r 5 (WAR) Š Output dependence r 3 r 1 op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 Š Control dependence 10

11 $/8,QWHUORFNDQG3HQDOW\ IF ID RD dist=1 dist=2 dist=3 i+1: _ R X i+2: _ R X i+3: _ R X c b a ALU i: R X _ i+1: R y _ i+2: R z _ ALU FORWARDING PATHS MEM WB i: R X _ i+1: R Y _ i: R X _ (i o i+1) Forwarding via Path a (i o i+2) Forwarding via Path b (i o i+3) i writes R1 before i+3 reads R1 11

12 /RDG,QWHUORFNDQG3HQDOW\ IF ID RD dist=1 dist=2 dist=3 i+1: _ R X i+2: _ R X i+3: _ R X e d ALU LOAD FORWARDING PATH(s) MEM i: R X mem[ ] i+1: R y _ i: R X mem[ ] i+2: R z _ i+1: R Y _ WB i: R X mem[ ] (i o i+1) Stall i+1 (i o i+1) Forwarding via Path d (i o i+2) i writes R1 before i+2 reads R1 12

13 0DMRU3HQDOW\/RRSVRI3LSHOLQLQJ IF ID RD 3. LOAD PENALTY 2. ALU PENALTY ALU MEM WB 1. BRANCH PENALTY Performance Objective: Reduce CPI to 1. 13

14 ,QWURGXFWLRQWR0RGHUQ 6XSHUVFDODU 3URFHVVRUV 14

15 /LPLWDWLRQVRI6FDODU3LSHOLQHV Upper Bound on Scalar Pipeline Throughtput Limited by IPC = 1 Inefficient Unification Into Single Pipeline Long latency for each instruction Performance Lost Due to Rigid Pipeline Unnecessary stalls 15

16 6FDODUWR6XSHUVFDODU3LSHOLQHV Parallel Pipeline - Wide pipelines - Advance multiple instructions per cycle Diversified Pipeline - Multiple functional unit types - Mix of different functional units Dynamic Pipeline - Out of order execution - Distributed functional units 16

17 $0RGHUQ6XSHUVFDODU3URFHVVRU Fetch Instruction/Decode Buffer In Order Decode Dispatch Dispatch Buffer Issue Reservation Stations Out of Order Execute In Order Finish Complete Retire Reorder/ Completion Buffer Store Buffer 17

18 )ORZ3DWKVRI6XSHUVFDODUV I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow 18

19 ,QVWUXFWLRQ)ORZ7HFKQLTXHV %UDQFK3HQDOW\ 19

20 :KDW V6R%DG$ERXW%UDQFKHV" Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Issue Branch Reservation Stations Execute Finish Complete Retire Reorder/ Completion Buffer Store Buffer 20

21 5LVHPDQDQG)RVWHU V6WXG\ 7 benchmark programs on CDC-3600 Assume infinite machine: - Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit If bounded to single basic block, i.e. no bypassing of branches maximum speedup is 1.72 Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: Max Speedup:

22 %UDQFK3UHGLFWLRQ Target Address Generation - Access register PC, GP register, Link register - Perform calculation +/- offset, auto incrementing/decrementing Target Speculation Condition Resolution - Access register Condition code register, data register, count register - Perform calculation Comparison of data register(s) Condition Speculation 22

23 7DUJHW$GGUHVV*HQHUDWLRQ Fetch Reg. ind. with offset Reg. ind. PCrel. Decode Dispatch Decode Buffer Dispatch Buffer Issue Branch Reservation Stations Execute Finish Complete Completion Buffer Store Buffer Retire 23

24 &RQGLWLRQ5HVROXWLRQ Fetch Decode Buffer GP reg. value comp. CC reg. Decode Dispatch Dispatch Buffer Issue Branch Reservation Stations Execute Finish Complete Retire Completion Buffer Store Buffer 24

25 %UDQFK,QVWUXFWLRQ6SHFXODWLRQ specu. cond. prediction specu. target Branch Predictor (using a BTB) FA-mux PC npc to Icache npc(seq.) = PC+4 Fetch Decode Buffer BTB update (target addr. and history) Decode Dispatch Dispatch Buffer Issue Branch Reservation Stations Execute Finish Completion Buffer 25

26 ([DPSOH3UHGLFWLRQ$OJRULWKP Prediction accuracy approaches maximum with as few as 2 preceding branch occurrences used as history T N TT TT T TN T TN NT T Results (%) [IBM RS/6000 Study, Nair, 1992] IBM1 IBM2 IBM3 IBM4 DEC CDC T T N N NN N T N last two branches next prediction 26

27 2WKHU3UHGLFWLRQ$OJRULWKPV N N T Saturation Counter t T T N T N t? T t T t? N Hysteresis Counter n? TN T N T n N n? TN T N T n N Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement. 27

28 2SWLPDO3UHGLFWRU([KDXVWLYH6HDUFK There are 2 20 possible state machines of 2-bit predictors Pruning uninteresting and redundant machines leaves 5248 It is possible to exhaustively search and find the optimal predictor for a benchmark predict NT predict T Benchmark Best Pred. % spice2g N T doduc 94.3 * gcc 89.1 * espresso 89.1 * li 87.1 * eqntott 87.9 * Saturation Counter is near optimal in all cases! 28

29 1XPEHURI&RXQWHU%LWV1HHGHG Benchmark Prediction Accuracy (Overall CPI Overhead) 3-bit 2-bit 1-bit 0-bit spice2g (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031) doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022) gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128) espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176) li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049) Branch history table size: Direct-mapped array of 2k entries Programs, like gcc, can have over 7000 conditional branches In collisions, multiple branches share the same predictor Variation of branch penalty with branch history table size level out at

30 +16 %+7DQG%7$& PC icache Branch History Table Branch Target Address Cache Decode Decode Buffer feedback BRN Dispatch Dispatch Buffer Reservation Stations SFX SFX CFX FPU LS Issue Execute Branch Finish Completion Buffer 30

31 33&)HWFK$GGUHVV*HQHUDWLRQ instruction cache BHT BTAC FAR fetch Prediction Logic (4 instructions) Target Seq Addr decode Prediction Logic (4 instructions) Target Seq Addr dispatch branch execute Prediction Logic (4 instructions) Target Seq Addr Target Exception Logic + complete PC 31

32 *OREDO%UDQFK3UHGLFWLRQ So far, the prediction of each static branch instruction is based solely on its own past behavior and not the behaviors of other neighboring static branch instructions Branch History Register (shift left when update) index Pattern History Table (PHT) Branch Resolution old PHT Bits FSM Logic new Prediction 32

33 /HYHO$GDSWLYH3UHGLFWLRQ ><HK Two-level adaptive branch prediction - 1st level: History of last k (dynamic) branches encountered - 2nd level: branch behavior of the last s occurrences of the specific pattern of these k branches - Use a Branch History Register (BHR) in conjunction with a Pattern History Table (PHT) Example: (k=8, s=6) - Last k branches with the behavior ( ) - s-bit History at the entry ( ) is [101010] - Using history, branch prediction algorithm predicts direction of the branch Effectiveness: - Average 97% accuracy for SPEC - Used in the Intel P6 and AMD K6 33

34 1RPHQFODWXUH^*3`$^JSV` PC Pattern History Table (PHT) Branch History Shift Register (BHSR) (shift left when update) To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 2 18 x 2 bits total = 524 kbits P (512x4) BHR: 12 bits; g (1) PHT: 2 12 x 2 bits total = 33 kbits P (512x4) BHR: 6 bits; s (512) PHT: 2 6 x 2 bits total = 78 kbits index Branch Result PHT Bits old FSM Logic new Prediction 34

35 *OREDO%+656FKHPH*$V Branch Address j bits Branch History Shift Register (BHSR) k bits Prediction BHT of 2 x 2 j+k 35

36 3HU%UDQFK%+656FKHPH3$V Branch Address i bits j bits Standard BHT Branch History Shift Register (BHSR) k x 2 i Prediction k bits BHT of 2 x 2 j+k 36

37 Branch Address j bits xor Branch History Shift Register (BHSR) k bits Prediction BHT of 2 x 2 max(j,k) 37

38 5HJLVWHU'DWD)ORZ7HFKQLTXHV $/83HQDOW\ 38

39 ,QWHULQVWUXFWLRQ'HSHQGHQFHV Š Data dependence r 3 r 1 op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Š Anti-dependence r 3 r 1 op r 2 Write-after-Read r 1 r 4 op r 5 (WAR) Š Output dependence r 3 r 1 op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 Š Control dependence 39

40 5HVROYLQJ)DOVH'HSHQGHQFHV (1) R4 R3 + 1 Must Prevent (2) from completing before (1) is dispatched (2) R3 R5 + 1 (1) R3 R3 op R5 R3 Must Prevent (2) from completing before (1) completes (2) R3 R5 + 1 Stalling: delay Dispatching (or write back) of the 2nd instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR) Register Renaming: use a different register (WAW & WAR) 40

41 5HJLVWHU5HQDPLQJ Anti and output dependencies are false dependencies r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 The dependence is on name/location rather than data Given infinite number of registers, anti and output dependencies can always be eliminated Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 - r4 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 -r4 41

42 5HJLVWHU5HQDPLQJ0HFKDQLVPV ARF Map Table Data Busy Tag RRF Register specifier Data Valid Next entry to be allocated Next entry to complete Operand read 42

43 (OHPHQWVRI0RGHUQ0LFURGDWDIORZ inorder out-of-order Allocate Reorder Buffer entries Dispatch Buffer Branch Dispatch Reg. File Integer Integer Float.- Load/ Point Store Reg. Write Back Ren. Reg. Reservation Stations Forwarding results to Res. Sta. & rename registers inorder Compl. Buffer (Reorder Buff.) Complete Managed as a queue; Maintains sequential order of all Instructions in flight ( takeoff = dispatching; landing = completion) 43

44 0HPRU\'DWD)ORZ7HFKQLTXHV /RDG3HQDOW\ 44

45 7RWDO2UGHULQJRI/RDGV 6WRUHV Keep all loads and stores totally in order with respect to each other However, loads and stores can execute out of order with respect to other types of instructions (while obeying register data-dependences) Except, stores must still be held for all previous instructions 45

46 7KH '$;3<µ([DPSOH Y[ i ] = A * X[ i ] + Y[ i ] LD F0, a ADDI R4, Rx, #512 ; last address Loop: LD F2, 0(Rx) ; load X[ i ] MULTD F2, F0, F2 ; A*X[ i ] LD F4, 0(Ry) ; load Y[ i ] ADDD F4, F2, F4 ; A*X[ i ] + Y[ i ] SD F4, 0(Ry) ; store into Y[ i ] ADDI Rx, Rx, #8 ; inc. index to X ADDI Ry, Ry, #8 ; inc. index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done?? LD MULTD ADDD SD LD 46

47 3URFHVVLQJRI/RDG6WRUH,QVWUXFWLRQV inorder Dispatch Buffer Dispatch Arch. RF Reg. Write Back Ren. RF out-of-order Branch Integer Integer Float.- Point Load/ Store Reservation Stations Address Generation Address Translation Memory Access Reorder Buffer Data Memory inorder Store Buffer Complete Retire 47

48 /RDG6WRUH8QLWV 6WRUH%XIIHU Reservation Station Address Generation Store Load Address Generation Address Translation unit unit Address Translation Memory Access Speculative State (Finished) Store Buffer Data Address Committed In-order State (Completed) Store Buffer Data Cache Memory Update 48

49 /RDG%\SDVVLQJ Loads can be allowed to bypass older stores if no aliasing is found - Older stores addresses must be computed before loads can be issued to allow checking for RAW Alternatively, a load can assume no aliasing and bypass older stores speculatively - validation of no aliasing with previous stores must be done and provide mechanism for reversing the effect Stores are kept in ROB (or Finished Store Buffer) until all older instructions complete At completion time, a store is moved to the Completed Store Buffer to wait for turn to access cache Store is consider completed. Latency beyond this point has little effect on the processor throughput 49

50 ,OOXVWUDWLRQRI/RDG%\SDVVLQJ Reservation Station Store unit Tag match Load unit Address (Finished) Store Buffer data addr Data If no match: update destination register (Completed) Store Buffer Data Cache Match/No match 50

51 /RDG)RUZDUGLQJ If a pending load is RAW dependent on an earlier store still in the store buffer, it need not wait till the store is issued to the data cache The load can be directly satisfied from the store buffer if both load and store addresses are valid and the data is available in the store buffer This avoids the latency of accessing the data cache 51

52 ,OOXVWUDWLRQRI/RDG)RUZDUGLQJ Reservation Station Store unit Tag match Load unit Address (Finished) Store Buffer data addr match Data (Completed) Store Buffer If match: forward to destination register Data Cache Match/No match 52

53 6SHFXODWLYH/RDG%\SDVVLQJ Reservation Station (Finished) Store Buffer (Completed) Store Buffer Store Unit data Tag match When load is issued addr Load Unit Tag match At store completion Finished addr data Load Buffer update renamed register Address Data Data Cache Match/No match Yes: load forwarding No: spec. load bypassing Match/No match update in-order state (architectural reg s) yes: flush aliased load and all younger instructions 53

54 'XDO3RUWHG1RQ%ORFNLQJ&DFKH Reservation Station Store Load Load unit unit Address unit Cache miss Address Cache miss (Finished) (Completed) Store Buffer Data Data Data Cache Missed load queue Main Memory 54

55 3UHIHWFKLQJ'DWD&DFKH Branch Predictor I-cache Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations branch integer integer floating store load point Memory Reference Prediction Prefetch Queue Completion Buffer Complete Store Buffer Data Cache Main Memory 55

56 &XUUHQW&KDOOHQJHVLQ 6XSHUVFDODU'HVLJQ 56

57 ,URQ/DZµRI3URFHVVRU3HUIRUPDQFH 1/Processor Performance = Wall-Clock Time Program Instructions Cycles = Program X Instruction X Time Cycle (instr. count) (CPI) (cycle time) Processor Performance = IPC x GHz Instr. Count 57

58 )UHTXHQF\ 3HUIRUPDQFH%RRVW 10,000 1,000 Frequency (MHz) )UHTÃX$UFK )UHTÃ3URFHVV Pentium 4 proc 4X Pentium II and III proc 13X Pentium proc i µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ Frequency Increased 50X 13X due to process technology Additional 4X due to microarchitecture 100 Relative 10 Performance 1 i486 Relative Performance Relative Frequency Pentium proc Pentium 4 proc Pentium II and III proc 1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 6X 13X Performance Increased >75X 13X due to process technology Additional >6X due to microarchitecture *Note: Performance measured using SpecINT and SpecFP Source: Intel Corporation 58

59 )UHTXHQF\YV3DUDOOHOLVP Increase Frequency (GHz) - Deeper Pipelines - Increased Overall Latency - Lower IPC Increase Instruction Parallelism (IPC) - Wider Pipelines - Increased Complexity - Lower GHz 59

60 'HHSHUDQG:LGHU3LSHOLQHV Fetch Dec. Disp. Exec. Mem. Retire Fetch Decode Dispatch Execute Memory Branch Mispredict Penalty Retire 60

61 )URQW(QG3LSH'HSWK3HQDOW\ Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize 61

62 $OOHYLDWH3LSH'HSWK3HQDOW\ Front-End Contraction - Code Re-mapping and Caching - Trace Construction, Caching, Optimization - Leverage Back-End Optimizations Back-End Optimization - Multiple-Branch, Trace, Stream, Prediction - Code Reordering, Alignment, Optimization - Pre-decode, Pre-rename, Pre-scheduling - Memory Pre-fetch Prediction and Control 62

63 ([HFXWLRQ&RUH,PSURYHPHQW Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching 63

64 +RZ'HHS&DQ<RX*R" Frequency CPI Performance Power 15 57? Pipeline Depth [Ed Grochowski, 7/6/01] Source: Intel Corporation 64

65 +RZ0XFK,/3,V7KHUH" Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90 65

66 /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV 63(&LQW 66

67 SPECint95/MHz /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV Landscape of Microprocessor 63(&LQW Families (SPECint95) SPECint PIII 0.05 Athlon 5 PII 0.04 PIII PPro Athlon Pentium 0.02 Alpha AMD-x86 Intel-x Frequency (MHz) ** Data source 67

68 SPECint2000/MHz /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV Landscape of Microprocessor 63(&LQW Families (SPECint2K) e PIII-Xeon 264A B Itanium 800 SPECint C Sparc-III Athlon Pentium 4 Intel-x86 AMD-x86 Alpha PowerPC Sparc IPF Frequency (MHz) ** Data source 68

69 /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV 63(&LQW 69

70 7KH3HQWLXP Š 3URFHVVRU 70

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics EE382A Lecture 3: Superscalar and Out-of-order Processor Basics Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 3-1 Announcements HW1 is due today Hand

More information

Superscalar Organization

Superscalar Organization Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average

More information

ECE/CS 552: Introduction to Superscalar Processors

ECE/CS 552: Introduction to Superscalar Processors ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines

More information

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 9: Modern Superscalar Out-of-Order Processors John P. Shen & Gregory Kesden September 27, 2017 Lecture #7 Processor Architecture & Design Lecture #8 Pipelined

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and

More information

Day 1: Introduction Course: Superscalar Architecture

Day 1: Introduction Course: Superscalar Architecture Day 1: Introduction Course: Superscalar Architecture 12 th International ACACES Summer School 10-16 July 2016, Fiuggi, Italy Prof. Mikko Lipasti Lecture notes based in part on slides created by John Shen

More information

Superscalar Organization

Superscalar Organization Superscalar Organization ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Stage Phase Function performed CPU, circa 1986 IF φ 1 Translate virtual instr. addr. using TLB φ 2 Access

More information

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Pipeline Processor Design

Pipeline Processor Design Pipeline Processor Design Beyond Pipeline Architecture Virendra Singh Computer Design and Test Lab. Indian Institute of Science Bangalore virendra@computer.org Advance Computer Architecture Branch Hazard

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3. Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update

More information

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL Beyond Pipelining Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

Superscalar Processor Design

Superscalar Processor Design Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline //11 Limitations of Our Simple stage Pipeline Diversified Pipelines The Path Toward Superscalar Processors HPCA, Spring 11 Assumes single cycle EX stage for all instructions This is not feasible for Complex

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L10: Branch Prediction Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab2 and prelim grades Back to the regular office hours 2 1 Overview

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Instruction Level Parallelism. Taken from

Instruction Level Parallelism. Taken from Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont GAS STATION Pipelining & Hazards II Fall 208 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith,

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Execution/Effective address

Execution/Effective address Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput

More information

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections ) Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

Superscalar Processor

Superscalar Processor Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware

More information

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University EE382A Lecture 5: Branch Prediction Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 5-1 Announcements Project proposal due on Mo 10/14 List the group

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

COSC4201. Prof. Mokhtar Aboelaze York University

COSC4201. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 3 Multi Cycle Operations Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RTI) 1 Multicycle Operations More than one function unit, each

More information