Replenishing the Microarchitecture Treasure Chest. CMuART Members

Size: px

Start display at page:

Download "Replenishing the Microarchitecture Treasure Chest. CMuART Members"

Dinah Quinn
6 years ago
Views:

1 Replenishing the Microarchitecture Treasure Chest Prof. John Paul Shen Electrical and Computer Engineering Department University UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1 CMuART Members Current Ph.D. Students: 1. Bryan Black. Yuan Chou 3. Alex Dean. Ryan Rakvic 5. Bob Rychlik Current M.S. Students: 1. Candice Bechem. Jonathan Combs 3. Jeffrey Heid. Kyle Oppenheim CMuART (PhD) Alumni: 1. Ron Bianchini (FORE & CMU). Mauricio Breternitz (Moto) 3. Trung Diep (Intel). F. Joel Ferguson (UCSC) 5. Andrew Huang (Moto). Mikko Lipasti (IBM & UW) 7. Chris Newburn (Intel). Derek Noonburg (S3) 9. Scott Robinson (Intel) 1. Mike Schuette (Moto) 11. Kent Wilken (UC-Davis) 1. Andy Wolfe (S3) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

2 Microprocessor Performance Unmatched by Any Other Industry! [John Crawford, Intel, 1993] Doubling every 1 months (19-199): total of X - Cars travel at, MPH; get 1, miles/gal. - Air travel: L.A. to N.Y. in seconds (MACH ) - Wheat yield:, bushels per acre Doubling every months ( ): total of 9,X - Cars travel at, MPH; get 15, miles/gal. - Air travel: L.A. to N.Y. in seconds (MACH 9,) - Wheat yield: 9, bushels per acre UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3 Leveraging the Treasure Chest All originally invented in the 19 s - Pipelining - Cache Memories - Multiple Instruction Issue - Out of Order Execution - Dataflow Machines - Vector Machines - Virtual Memory - Optimizing Compilers - Operating Systems UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

3 Iron Law of Processor Performance Time Processor Performance = Program Instructions Cycles Time = X X Program Instruction Cycle (code size) (CPI) (cycle time) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 5 Evolution of Microprocessors by 9 Transistor Count 1K-1K 1K-1M 1M-3M 1,M Clock Frequency.-MHz -MHz -MHz 1GHz Instruction/cycle << (?) MIPS/MFLOPS << , 1, UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

4 Strong Diminishing Returns on mksim perl Issue Width ijpeg li Issue Width UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 7 Looking for A Paradigm Shift Revisit previously-assumed limits and try to go beyond these limits. 197 s - Flynn s Bottleneck... Branch Prediction 199 s - Dataflow Limit... Value Prediction UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

5 Once Upon a Time...Fall 1995 Load Value Locality: Value Locality (%) Value Locality (%) 1 cc1-71 cjpeg 1 History=1 History=1 compress doduc eqntott gawk gperfgrep Alpha AXP PowerPC cc1-71 cc1 cjpeg compress doduc eqntott gawk gperfgrep hydrod mpegperl hydrod mpegperl quick sc swm5 tomcatv xlisp quick sc swm5 tomcatv xlisp UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 9 Concept of Value Locality Load Value Locality (%) Register Value Locality (%) History=1 History=1 cc1-71 cjpeg compresseqntott gawk gperf History=1 History= Load Value Locality grep mpeg perl Register Value Locality cc1-71 cjpeg compresseqntott gawk gperf grep mpeg perl UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1 quick quick sc sc xlisp xlisp

6 Dynamic Pipeline Contraction Branch Prediction Value Prediction Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 11 Superspeculation Techniques Dependence Prediction Reg. Value Prediction Load Value Prediction Mem. Alias Prediction FETCH FETCH FETCH FETCH DECODE/ DECODE/ DECODE/ DECODE/ DISPATCH DISPATCH DISPATCH DISPATCH EXEC. EXEC. ADDR. ADDR. COMPL. COMPL. TLB MEM. TLB MEM. COMPL. COMPL. UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1

7 SPECint95 Performance (1 wide) Sustained AP +LVP +VP +-Phase +GAg TC Baseline Perfect K/Infinite K/ ports K/ ports K/ ports go mksim gcc compress li ijpeg perl vortex HM UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 13 SPECfp95 Performance (1 wide) Sustained Perfect K/Infinite K/ ports K/ ports K/ ports +AP +LVP +VP +-Phase +GAg TC Baseline tomcatv swim mgrid applu apsi fpppp HM Benchmark UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1

8 li A Possible New Paradigm mksim perl Issue Width Issue Width UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page ijpeg li Hybrid Value Predictors 1% VP1 VP Prediction Rates for VP1 and VP 9% Not Predicted % 7% Incorrect % 5% Both Correct % 3% % 1% Stride+ Unique FCM Unique % compress gcc go ijpeg mksim perl vortex SPECint applu fpppp mgrid swim tomcatv turb3d SPECfp SPEC95 UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1

9 Prediction Usefulness Usefulness Tracking Im pact on Usefulness Tracking Im pact on Prediction Rate %.5. 9% % 7% Not Predic ted % 5% Incorrect A ll SPECf p SPECint Without Tracking W ith Trac king % 3% % 1% % A ll SPECf p SPECint A ll SPECf p SPECint Without Tracking With Tracking Correct UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 17 Current Value Prediction Landscape Increasing Sophistication - Hybrid Predictors - Adaptive Predictors Increasing Efficiency - Selective Prediction - Register-based Prediction - Compiler Assistance Extensions Into Other Domains - VLIW-based Value Prediction - Dynamic Instruction Reuse - Aggressive Partial Evaluation UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1

10 Instruction Supply Problem Branch Prediction T-cache FETCH I-cache Instruction Buffer Instruction Flow DECODE/ Integer Floating-point Media Memory Memory Data Flow Register Data Flow Reorder Buffer Store Buffer COMPLETE D-cache UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 19 Wide-Machine Instruction Three Major Challenges: 1. Multiple-Branch Prediction I-cache. Multiple Groups 3. Alignment and Collapsing Branch Prediction FETCH DECODE/ Instruction Buffer DISPATCH UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

11 High-bandwidth Instruction Time Enhanced Instruction Cache Multi-branch prediction Multi-ported instruction cache fetch Instructions align & collapse Trace Cache Next trace prediction Trace cache fetch Execution Core Execution Core Completion Time Branch predictor update Instructions align & collapse Trace construction Trace cache fill Trace predictor update UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1 Conventional Trace Cache Next Trace Pred. trace_id Trace Cache I-Cache History Hash Fill Unit br. hist. Buffer Execution Core Completion UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

12 The Block-Based Trace Cache Trace Table Rename Table trace_id History Hash block_ids p re-collap se Fill Unit br. hist. Block Cache Final Collap se Comp letion Buffer Execution Core I-Cache UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3 Next Trace Prediction b_id b_id1 b_id b_id3 branch history Hash Function Next trace_id tag index v tag Trace Table 1... w block_ids... Hit = w p red. block_ids to the block cache UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

13 Replicated Block Cache N = n word lines Instructions from the block fill unit block_id (n -bit) decoder FA i1 i ib... direct mapped cache copy-1 Block Cache b inst copy- copy-3 copy- Final Collapse 1 Buffer UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 5 Block Rename Table Block fetch address Tag Index v tag block_id N= entries = = Block fetch address renamed to a block_id UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

14 Block Cache Fragmentation Avg. # of inst. in fetch buffer ,,,3, 1,1 comp gcc go ijpeg li mk perl vort h-m Block size (b), Block fetch width (w) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 7 Block Cache Replication Avg. # of inst. in fetch buffer 1 1 (b=, N=9) w= w=3 w= w=5 w=1 comp gcc go ijpeg li mk perl vort benchmark UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

15 Trace Table Size Hit rate (%) k k k comp gcc go ijpeg li mk perl vort benchmark UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 9 Block Cache Hit Rate 1% % % % % % Number of blocks fetched hit miss no pred comp gcc go ijpeg li mk perl vort UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3

16 Performance vs. Block Cache Size compress k k k k 1k Block cache entries (N) gcc k k k k 1k Block cache entries (N) go k k k k 1k Block cache entries (N) ijpeg k k k k 1k Block cache entries (N) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 31 Performance vs. Block Cache Size li k k k k 1k Block cache entries (N) mksim k k k k 1k Block cache entries (N) perl k k k k 1k Block cache entries (N) vortex k k k k 1k Block cache entries (N) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3

17 Aggregate vs. Block Cache Size harmonic mean 1 1 block cache data dep endencies/instruction window lim it 1 mispredictions 1 perfect fetch block cache capacity misses & fragmentation perfect branch predict taken branch boundary baseline branch predict k k k k 1k branch misp rediction perfect block predict Block cache entries (N) real block predict UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 33 Conventional vs. Block-Based compress 1 1 block based conventional bytes 1 1 block based conventional 1 1 gcc gcc bytes block based conventional bytes 1 go 1 ijpeg 1 1 block based conventional bytes block based conventional bytes UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3

18 Conventional vs. Block-Based 1 li 1 mksim 1 1 block based conventional block based conventional bytes bytes 1 perl 1 vortex 1 1 block based conventional bytes block based conventional bytes UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 35 Aggregate Comparison 1 harmonic mean 1 block based conventional bytes UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3

19 Where Do We Go From Here... New Forcing Functions: High Frequency Instruction Level Parallelism Simultaneously optimize and MHz Silicon and Time Efficient Performance (STEP) Increase both /die_area and δ()/δ(year) Dynamic/Static and Hardware/Software fusion Reconcile generational latencies of H/W and S/W Bridge compiled object code & machine executable UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 37 How to have your cake and eat it too SPECint95 SPECint Average.5 Pipe Depth 3 MHz MHz 1 Pipe Width UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3

20 Journey of Many Small Steps Improvement Major microarchitecture generations Need to Increase δ()/δ(year) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 39 Get the Best of Both Worlds Microarchitecture generational latency: -5 years Compiler generational latency: -1 years (?) Compiler Object Code Code Transformation Executable Image Execution Core UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA: