John P. Shen Microprocessor Research Intel Labs March 19, 2002

Size: px

Start display at page:

Download "John P. Shen Microprocessor Research Intel Labs March 19, 2002"

Georgiana Knight
6 years ago
Views:

1 &6 0LFURDUFKLWHFWXUH 6XSHUVFDODU 3URFHVVRU'HVLJQ John P. Shen Microprocessor Research Intel Labs March 19, 2002

2 0RRUH V /DZ&RQWLQXHV«Transistors (MT) 10,000 1, Transistors Double Every Two Years P6 486 Pentium Pentium proc 2X Growth in 1.96 Years! Die size (mm) Die Size Grows 14% in Two Years P6 Pentium proc ~7% Growth per Year ~2X Growth in 10 Years Frequency (MHz) 100,000 10,000 1, P6 Pentium proc Frequency Doubles in Two Years 10 Power (Watts) and Power Grows Exponentially P6 Pentium proc Source: Intel Corporation 2

3 «)RU$W/HDVW$QRWKHU'HFDGH Transistors (MT) 10,000 1, ~2B Transistors B 425M Pentium III Pentium Pro 486 Pentium proc Die size (mm) ~40mm Die Pentium Pro proc 486 Pentium proc ~7% growth per year ~2X growth in 10 years Frequency (MHz) 100,000 10,000 1, GHz 14GHz 6.5GHz 3 GHz Pentium III proc Pentium Pro Pentium proc ~30 GHz Power (Watts) 10,000 1, Power Too High Pentium processors Source: Intel Corporation 3

4 0LFURSURFHVVRU3HUIRUPDQFH 4

5 (YROXWLRQRI0LFURSURFHVVRUV Transistor Count 2K-100K 100K-1M 1M-100M 100M-1B Clock Frequency 0.1-3MHz 3-30MHz 30M-1GHz 1-15GHz Instruction/Cycle < (?) 5

6 3HUIRUPDQFH*URZWKLQ3HUVSHFWLYH Doubling every 18 months ( ): - total of 3,200X - Cars travel at 176,000 MPH; get 64,000 miles/gal. - Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) - Wheat yield: 320,000 bushels per acre Doubling every 24 months ( ): - total of 36,000X - Cars travel at 2,400,000 MPH; get 600,000 miles/gal. - Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) - Wheat yield: 3,600,000 bushels per acre Unmatched by any other industry!! [John Crawford, Intel, 1993] 6

7 ,URQ/DZµRI3URFHVVRU3HUIRUPDQFH 1/Processor Performance = Wall-Clock Time Program Instructions Cycles = Program X Instruction X Time Cycle (instr. count) (CPI) (cycle time) 7

8 5HYLHZRI6FDODU3LSHOLQHG 3URFHVVRUV 8

9 6FDODU3LSHOLQHG3URFHVVRUV The 6-stage TYPICAL pipeline: ALU LOAD STORE BRANCH IF: I-CACHE PC I-CACHE PC I-CACHE PC I-CACHE PC IF 1 ID: DECODE DECODE DECODE DECODE ID 2 OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3 ADDR. GEN. ALU 4 EX: ALU OP. RD. MEM. MEM 5 OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6 WR. MEM. WR. PC 9

10 ,QWHULQVWUXFWLRQ'HSHQGHQFHV Š Data dependence r 3 r 1 op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Š Anti-dependence r 3 r 1 op r 2 Write-after-Read r 1 r 4 op r 5 (WAR) Š Output dependence r 3 r 1 op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 Š Control dependence 10

11 $/8,QWHUORFNDQG3HQDOW\ IF ID RD dist=1 dist=2 dist=3 i+1: _ R X i+2: _ R X i+3: _ R X c b a ALU i: R X _ i+1: R y _ i+2: R z _ ALU FORWARDING PATHS MEM WB i: R X _ i+1: R Y _ i: R X _ (i o i+1) Forwarding via Path a (i o i+2) Forwarding via Path b (i o i+3) i writes R1 before i+3 reads R1 11

12 /RDG,QWHUORFNDQG3HQDOW\ IF ID RD dist=1 dist=2 dist=3 i+1: _ R X i+2: _ R X i+3: _ R X e d ALU LOAD FORWARDING PATH(s) MEM i: R X mem[ ] i+1: R y _ i: R X mem[ ] i+2: R z _ i+1: R Y _ WB i: R X mem[ ] (i o i+1) Stall i+1 (i o i+1) Forwarding via Path d (i o i+2) i writes R1 before i+2 reads R1 12

13 0DMRU3HQDOW\/RRSVRI3LSHOLQLQJ IF ID RD 3. LOAD PENALTY 2. ALU PENALTY ALU MEM WB 1. BRANCH PENALTY Performance Objective: Reduce CPI to 1. 13

14 ,QWURGXFWLRQWR0RGHUQ 6XSHUVFDODU 3URFHVVRUV 14

15 /LPLWDWLRQVRI6FDODU3LSHOLQHV Upper Bound on Scalar Pipeline Throughtput Limited by IPC = 1 Inefficient Unification Into Single Pipeline Long latency for each instruction Performance Lost Due to Rigid Pipeline Unnecessary stalls 15

16 6FDODUWR6XSHUVFDODU3LSHOLQHV Parallel Pipeline - Wide pipelines - Advance multiple instructions per cycle Diversified Pipeline - Multiple functional unit types - Mix of different functional units Dynamic Pipeline - Out of order execution - Distributed functional units 16

17 $0RGHUQ6XSHUVFDODU3URFHVVRU Fetch Instruction/Decode Buffer In Order Decode Dispatch Dispatch Buffer Issue Reservation Stations Out of Order Execute In Order Finish Complete Retire Reorder/ Completion Buffer Store Buffer 17

18 )ORZ3DWKVRI6XSHUVFDODUV I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow 18

19 ,QVWUXFWLRQ)ORZ7HFKQLTXHV %UDQFK3HQDOW\ 19

20 :KDW V6R%DG$ERXW%UDQFKHV" Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Issue Branch Reservation Stations Execute Finish Complete Retire Reorder/ Completion Buffer Store Buffer 20

21 5LVHPDQDQG)RVWHU V6WXG\ 7 benchmark programs on CDC-3600 Assume infinite machine: - Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit If bounded to single basic block, i.e. no bypassing of branches maximum speedup is 1.72 Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: Max Speedup:

22 %UDQFK3UHGLFWLRQ Target Address Generation - Access register PC, GP register, Link register - Perform calculation +/- offset, auto incrementing/decrementing Target Speculation Condition Resolution - Access register Condition code register, data register, count register - Perform calculation Comparison of data register(s) Condition Speculation 22

23 7DUJHW$GGUHVV*HQHUDWLRQ Fetch Reg. ind. with offset Reg. ind. PCrel. Decode Dispatch Decode Buffer Dispatch Buffer Issue Branch Reservation Stations Execute Finish Complete Completion Buffer Store Buffer Retire 23

24 &RQGLWLRQ5HVROXWLRQ Fetch Decode Buffer GP reg. value comp. CC reg. Decode Dispatch Dispatch Buffer Issue Branch Reservation Stations Execute Finish Complete Retire Completion Buffer Store Buffer 24

25 %UDQFK,QVWUXFWLRQ6SHFXODWLRQ specu. cond. prediction specu. target Branch Predictor (using a BTB) FA-mux PC npc to Icache npc(seq.) = PC+4 Fetch Decode Buffer BTB update (target addr. and history) Decode Dispatch Dispatch Buffer Issue Branch Reservation Stations Execute Finish Completion Buffer 25

26 ([DPSOH3UHGLFWLRQ$OJRULWKP Prediction accuracy approaches maximum with as few as 2 preceding branch occurrences used as history T N TT TT T TN T TN NT T Results (%) [IBM RS/6000 Study, Nair, 1992] IBM1 IBM2 IBM3 IBM4 DEC CDC T T N N NN N T N last two branches next prediction 26

27 2WKHU3UHGLFWLRQ$OJRULWKPV N N T Saturation Counter t T T N T N t? T t T t? N Hysteresis Counter n? TN T N T n N n? TN T N T n N Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement. 27

28 2SWLPDO3UHGLFWRU([KDXVWLYH6HDUFK There are 2 20 possible state machines of 2-bit predictors Pruning uninteresting and redundant machines leaves 5248 It is possible to exhaustively search and find the optimal predictor for a benchmark predict NT predict T Benchmark Best Pred. % spice2g N T doduc 94.3 * gcc 89.1 * espresso 89.1 * li 87.1 * eqntott 87.9 * Saturation Counter is near optimal in all cases! 28

29 1XPEHURI&RXQWHU%LWV1HHGHG Benchmark Prediction Accuracy (Overall CPI Overhead) 3-bit 2-bit 1-bit 0-bit spice2g (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031) doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022) gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128) espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176) li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049) Branch history table size: Direct-mapped array of 2k entries Programs, like gcc, can have over 7000 conditional branches In collisions, multiple branches share the same predictor Variation of branch penalty with branch history table size level out at

30 +16 %+7DQG%7$& PC icache Branch History Table Branch Target Address Cache Decode Decode Buffer feedback BRN Dispatch Dispatch Buffer Reservation Stations SFX SFX CFX FPU LS Issue Execute Branch Finish Completion Buffer 30

31 33&)HWFK$GGUHVV*HQHUDWLRQ instruction cache BHT BTAC FAR fetch Prediction Logic (4 instructions) Target Seq Addr decode Prediction Logic (4 instructions) Target Seq Addr dispatch branch execute Prediction Logic (4 instructions) Target Seq Addr Target Exception Logic + complete PC 31

32 *OREDO%UDQFK3UHGLFWLRQ So far, the prediction of each static branch instruction is based solely on its own past behavior and not the behaviors of other neighboring static branch instructions Branch History Register (shift left when update) index Pattern History Table (PHT) Branch Resolution old PHT Bits FSM Logic new Prediction 32

33 /HYHO$GDSWLYH3UHGLFWLRQ ><HK Two-level adaptive branch prediction - 1st level: History of last k (dynamic) branches encountered - 2nd level: branch behavior of the last s occurrences of the specific pattern of these k branches - Use a Branch History Register (BHR) in conjunction with a Pattern History Table (PHT) Example: (k=8, s=6) - Last k branches with the behavior ( ) - s-bit History at the entry ( ) is [101010] - Using history, branch prediction algorithm predicts direction of the branch Effectiveness: - Average 97% accuracy for SPEC - Used in the Intel P6 and AMD K6 33

34 1RPHQFODWXUH^*3`$^JSV` PC Pattern History Table (PHT) Branch History Shift Register (BHSR) (shift left when update) To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 2 18 x 2 bits total = 524 kbits P (512x4) BHR: 12 bits; g (1) PHT: 2 12 x 2 bits total = 33 kbits P (512x4) BHR: 6 bits; s (512) PHT: 2 6 x 2 bits total = 78 kbits index Branch Result PHT Bits old FSM Logic new Prediction 34

35 *OREDO%+656FKHPH*$V Branch Address j bits Branch History Shift Register (BHSR) k bits Prediction BHT of 2 x 2 j+k 35

36 3HU%UDQFK%+656FKHPH3$V Branch Address i bits j bits Standard BHT Branch History Shift Register (BHSR) k x 2 i Prediction k bits BHT of 2 x 2 j+k 36

37 Branch Address j bits xor Branch History Shift Register (BHSR) k bits Prediction BHT of 2 x 2 max(j,k) 37

38 5HJLVWHU'DWD)ORZ7HFKQLTXHV $/83HQDOW\ 38

39 ,QWHULQVWUXFWLRQ'HSHQGHQFHV Š Data dependence r 3 r 1 op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Š Anti-dependence r 3 r 1 op r 2 Write-after-Read r 1 r 4 op r 5 (WAR) Š Output dependence r 3 r 1 op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 Š Control dependence 39

40 5HVROYLQJ)DOVH'HSHQGHQFHV (1) R4 R3 + 1 Must Prevent (2) from completing before (1) is dispatched (2) R3 R5 + 1 (1) R3 R3 op R5 R3 Must Prevent (2) from completing before (1) completes (2) R3 R5 + 1 Stalling: delay Dispatching (or write back) of the 2nd instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR) Register Renaming: use a different register (WAW & WAR) 40

41 5HJLVWHU5HQDPLQJ Anti and output dependencies are false dependencies r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 The dependence is on name/location rather than data Given infinite number of registers, anti and output dependencies can always be eliminated Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 - r4 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 -r4 41

42 5HJLVWHU5HQDPLQJ0HFKDQLVPV ARF Map Table Data Busy Tag RRF Register specifier Data Valid Next entry to be allocated Next entry to complete Operand read 42

43 (OHPHQWVRI0RGHUQ0LFURGDWDIORZ inorder out-of-order Allocate Reorder Buffer entries Dispatch Buffer Branch Dispatch Reg. File Integer Integer Float.- Load/ Point Store Reg. Write Back Ren. Reg. Reservation Stations Forwarding results to Res. Sta. & rename registers inorder Compl. Buffer (Reorder Buff.) Complete Managed as a queue; Maintains sequential order of all Instructions in flight ( takeoff = dispatching; landing = completion) 43

44 0HPRU\'DWD)ORZ7HFKQLTXHV /RDG3HQDOW\ 44

45 7RWDO2UGHULQJRI/RDGV 6WRUHV Keep all loads and stores totally in order with respect to each other However, loads and stores can execute out of order with respect to other types of instructions (while obeying register data-dependences) Except, stores must still be held for all previous instructions 45

46 7KH '$;3<µ([DPSOH Y[ i ] = A * X[ i ] + Y[ i ] LD F0, a ADDI R4, Rx, #512 ; last address Loop: LD F2, 0(Rx) ; load X[ i ] MULTD F2, F0, F2 ; A*X[ i ] LD F4, 0(Ry) ; load Y[ i ] ADDD F4, F2, F4 ; A*X[ i ] + Y[ i ] SD F4, 0(Ry) ; store into Y[ i ] ADDI Rx, Rx, #8 ; inc. index to X ADDI Ry, Ry, #8 ; inc. index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done?? LD MULTD ADDD SD LD 46

47 3URFHVVLQJRI/RDG6WRUH,QVWUXFWLRQV inorder Dispatch Buffer Dispatch Arch. RF Reg. Write Back Ren. RF out-of-order Branch Integer Integer Float.- Point Load/ Store Reservation Stations Address Generation Address Translation Memory Access Reorder Buffer Data Memory inorder Store Buffer Complete Retire 47

48 /RDG6WRUH8QLWV 6WRUH%XIIHU Reservation Station Address Generation Store Load Address Generation Address Translation unit unit Address Translation Memory Access Speculative State (Finished) Store Buffer Data Address Committed In-order State (Completed) Store Buffer Data Cache Memory Update 48

49 /RDG%\SDVVLQJ Loads can be allowed to bypass older stores if no aliasing is found - Older stores addresses must be computed before loads can be issued to allow checking for RAW Alternatively, a load can assume no aliasing and bypass older stores speculatively - validation of no aliasing with previous stores must be done and provide mechanism for reversing the effect Stores are kept in ROB (or Finished Store Buffer) until all older instructions complete At completion time, a store is moved to the Completed Store Buffer to wait for turn to access cache Store is consider completed. Latency beyond this point has little effect on the processor throughput 49

50 ,OOXVWUDWLRQRI/RDG%\SDVVLQJ Reservation Station Store unit Tag match Load unit Address (Finished) Store Buffer data addr Data If no match: update destination register (Completed) Store Buffer Data Cache Match/No match 50

51 /RDG)RUZDUGLQJ If a pending load is RAW dependent on an earlier store still in the store buffer, it need not wait till the store is issued to the data cache The load can be directly satisfied from the store buffer if both load and store addresses are valid and the data is available in the store buffer This avoids the latency of accessing the data cache 51

52 ,OOXVWUDWLRQRI/RDG)RUZDUGLQJ Reservation Station Store unit Tag match Load unit Address (Finished) Store Buffer data addr match Data (Completed) Store Buffer If match: forward to destination register Data Cache Match/No match 52

53 6SHFXODWLYH/RDG%\SDVVLQJ Reservation Station (Finished) Store Buffer (Completed) Store Buffer Store Unit data Tag match When load is issued addr Load Unit Tag match At store completion Finished addr data Load Buffer update renamed register Address Data Data Cache Match/No match Yes: load forwarding No: spec. load bypassing Match/No match update in-order state (architectural reg s) yes: flush aliased load and all younger instructions 53

54 'XDO3RUWHG1RQ%ORFNLQJ&DFKH Reservation Station Store Load Load unit unit Address unit Cache miss Address Cache miss (Finished) (Completed) Store Buffer Data Data Data Cache Missed load queue Main Memory 54

55 3UHIHWFKLQJ'DWD&DFKH Branch Predictor I-cache Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations branch integer integer floating store load point Memory Reference Prediction Prefetch Queue Completion Buffer Complete Store Buffer Data Cache Main Memory 55

56 &XUUHQW&KDOOHQJHVLQ 6XSHUVFDODU'HVLJQ 56

57 ,URQ/DZµRI3URFHVVRU3HUIRUPDQFH 1/Processor Performance = Wall-Clock Time Program Instructions Cycles = Program X Instruction X Time Cycle (instr. count) (CPI) (cycle time) Processor Performance = IPC x GHz Instr. Count 57

58 )UHTXHQF\ 3HUIRUPDQFH%RRVW 10,000 1,000 Frequency (MHz) )UHTÃX$UFK )UHTÃ3URFHVV Pentium 4 proc 4X Pentium II and III proc 13X Pentium proc i µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ Frequency Increased 50X 13X due to process technology Additional 4X due to microarchitecture 100 Relative 10 Performance 1 i486 Relative Performance Relative Frequency Pentium proc Pentium 4 proc Pentium II and III proc 1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 6X 13X Performance Increased >75X 13X due to process technology Additional >6X due to microarchitecture *Note: Performance measured using SpecINT and SpecFP Source: Intel Corporation 58

59 )UHTXHQF\YV3DUDOOHOLVP Increase Frequency (GHz) - Deeper Pipelines - Increased Overall Latency - Lower IPC Increase Instruction Parallelism (IPC) - Wider Pipelines - Increased Complexity - Lower GHz 59

60 'HHSHUDQG:LGHU3LSHOLQHV Fetch Dec. Disp. Exec. Mem. Retire Fetch Decode Dispatch Execute Memory Branch Mispredict Penalty Retire 60

61 )URQW(QG3LSH'HSWK3HQDOW\ Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize 61

62 $OOHYLDWH3LSH'HSWK3HQDOW\ Front-End Contraction - Code Re-mapping and Caching - Trace Construction, Caching, Optimization - Leverage Back-End Optimizations Back-End Optimization - Multiple-Branch, Trace, Stream, Prediction - Code Reordering, Alignment, Optimization - Pre-decode, Pre-rename, Pre-scheduling - Memory Pre-fetch Prediction and Control 62

63 ([HFXWLRQ&RUH,PSURYHPHQW Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching 63

64 +RZ'HHS&DQ<RX*R" Frequency CPI Performance Power 15 57? Pipeline Depth [Ed Grochowski, 7/6/01] Source: Intel Corporation 64

65 +RZ0XFK,/3,V7KHUH" Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90 65

66 /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV 63(&LQW 66

67 SPECint95/MHz /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV Landscape of Microprocessor 63(&LQW Families (SPECint95) SPECint PIII 0.05 Athlon 5 PII 0.04 PIII PPro Athlon Pentium 0.02 Alpha AMD-x86 Intel-x Frequency (MHz) ** Data source 67

68 SPECint2000/MHz /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV Landscape of Microprocessor 63(&LQW Families (SPECint2K) e PIII-Xeon 264A B Itanium 800 SPECint C Sparc-III Athlon Pentium 4 Intel-x86 AMD-x86 Alpha PowerPC Sparc IPF Frequency (MHz) ** Data source 68

69 /DQGVFDSHRI0LFURSURFHVVRU)DPLOLHV 63(&LQW 69

70 7KH3HQWLXP Š 3URFHVVRU 70

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA: