Replenishing the Microarchitecture Treasure Chest. CMuART Members
|
|
- Dinah Quinn
- 6 years ago
- Views:
Transcription
1 Replenishing the Microarchitecture Treasure Chest Prof. John Paul Shen Electrical and Computer Engineering Department University UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1 CMuART Members Current Ph.D. Students: 1. Bryan Black. Yuan Chou 3. Alex Dean. Ryan Rakvic 5. Bob Rychlik Current M.S. Students: 1. Candice Bechem. Jonathan Combs 3. Jeffrey Heid. Kyle Oppenheim CMuART (PhD) Alumni: 1. Ron Bianchini (FORE & CMU). Mauricio Breternitz (Moto) 3. Trung Diep (Intel). F. Joel Ferguson (UCSC) 5. Andrew Huang (Moto). Mikko Lipasti (IBM & UW) 7. Chris Newburn (Intel). Derek Noonburg (S3) 9. Scott Robinson (Intel) 1. Mike Schuette (Moto) 11. Kent Wilken (UC-Davis) 1. Andy Wolfe (S3) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
2 Microprocessor Performance Unmatched by Any Other Industry! [John Crawford, Intel, 1993] Doubling every 1 months (19-199): total of X - Cars travel at, MPH; get 1, miles/gal. - Air travel: L.A. to N.Y. in seconds (MACH ) - Wheat yield:, bushels per acre Doubling every months ( ): total of 9,X - Cars travel at, MPH; get 15, miles/gal. - Air travel: L.A. to N.Y. in seconds (MACH 9,) - Wheat yield: 9, bushels per acre UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3 Leveraging the Treasure Chest All originally invented in the 19 s - Pipelining - Cache Memories - Multiple Instruction Issue - Out of Order Execution - Dataflow Machines - Vector Machines - Virtual Memory - Optimizing Compilers - Operating Systems UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
3 Iron Law of Processor Performance Time Processor Performance = Program Instructions Cycles Time = X X Program Instruction Cycle (code size) (CPI) (cycle time) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 5 Evolution of Microprocessors by 9 Transistor Count 1K-1K 1K-1M 1M-3M 1,M Clock Frequency.-MHz -MHz -MHz 1GHz Instruction/cycle << (?) MIPS/MFLOPS << , 1, UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
4 Strong Diminishing Returns on mksim perl Issue Width ijpeg li Issue Width UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 7 Looking for A Paradigm Shift Revisit previously-assumed limits and try to go beyond these limits. 197 s - Flynn s Bottleneck... Branch Prediction 199 s - Dataflow Limit... Value Prediction UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
5 Once Upon a Time...Fall 1995 Load Value Locality: Value Locality (%) Value Locality (%) 1 cc1-71 cjpeg 1 History=1 History=1 compress doduc eqntott gawk gperfgrep Alpha AXP PowerPC cc1-71 cc1 cjpeg compress doduc eqntott gawk gperfgrep hydrod mpegperl hydrod mpegperl quick sc swm5 tomcatv xlisp quick sc swm5 tomcatv xlisp UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 9 Concept of Value Locality Load Value Locality (%) Register Value Locality (%) History=1 History=1 cc1-71 cjpeg compresseqntott gawk gperf History=1 History= Load Value Locality grep mpeg perl Register Value Locality cc1-71 cjpeg compresseqntott gawk gperf grep mpeg perl UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1 quick quick sc sc xlisp xlisp
6 Dynamic Pipeline Contraction Branch Prediction Value Prediction Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read Execute Rename Commit Op Read UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 11 Superspeculation Techniques Dependence Prediction Reg. Value Prediction Load Value Prediction Mem. Alias Prediction FETCH FETCH FETCH FETCH DECODE/ DECODE/ DECODE/ DECODE/ DISPATCH DISPATCH DISPATCH DISPATCH EXEC. EXEC. ADDR. ADDR. COMPL. COMPL. TLB MEM. TLB MEM. COMPL. COMPL. UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1
7 SPECint95 Performance (1 wide) Sustained AP +LVP +VP +-Phase +GAg TC Baseline Perfect K/Infinite K/ ports K/ ports K/ ports go mksim gcc compress li ijpeg perl vortex HM UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 13 SPECfp95 Performance (1 wide) Sustained Perfect K/Infinite K/ ports K/ ports K/ ports +AP +LVP +VP +-Phase +GAg TC Baseline tomcatv swim mgrid applu apsi fpppp HM Benchmark UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1
8 li A Possible New Paradigm mksim perl Issue Width Issue Width UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page ijpeg li Hybrid Value Predictors 1% VP1 VP Prediction Rates for VP1 and VP 9% Not Predicted % 7% Incorrect % 5% Both Correct % 3% % 1% Stride+ Unique FCM Unique % compress gcc go ijpeg mksim perl vortex SPECint applu fpppp mgrid swim tomcatv turb3d SPECfp SPEC95 UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1
9 Prediction Usefulness Usefulness Tracking Im pact on Usefulness Tracking Im pact on Prediction Rate %.5. 9% % 7% Not Predic ted % 5% Incorrect A ll SPECf p SPECint Without Tracking W ith Trac king % 3% % 1% % A ll SPECf p SPECint A ll SPECf p SPECint Without Tracking With Tracking Correct UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 17 Current Value Prediction Landscape Increasing Sophistication - Hybrid Predictors - Adaptive Predictors Increasing Efficiency - Selective Prediction - Register-based Prediction - Compiler Assistance Extensions Into Other Domains - VLIW-based Value Prediction - Dynamic Instruction Reuse - Aggressive Partial Evaluation UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1
10 Instruction Supply Problem Branch Prediction T-cache FETCH I-cache Instruction Buffer Instruction Flow DECODE/ Integer Floating-point Media Memory Memory Data Flow Register Data Flow Reorder Buffer Store Buffer COMPLETE D-cache UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 19 Wide-Machine Instruction Three Major Challenges: 1. Multiple-Branch Prediction I-cache. Multiple Groups 3. Alignment and Collapsing Branch Prediction FETCH DECODE/ Instruction Buffer DISPATCH UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
11 High-bandwidth Instruction Time Enhanced Instruction Cache Multi-branch prediction Multi-ported instruction cache fetch Instructions align & collapse Trace Cache Next trace prediction Trace cache fetch Execution Core Execution Core Completion Time Branch predictor update Instructions align & collapse Trace construction Trace cache fill Trace predictor update UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 1 Conventional Trace Cache Next Trace Pred. trace_id Trace Cache I-Cache History Hash Fill Unit br. hist. Buffer Execution Core Completion UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
12 The Block-Based Trace Cache Trace Table Rename Table trace_id History Hash block_ids p re-collap se Fill Unit br. hist. Block Cache Final Collap se Comp letion Buffer Execution Core I-Cache UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3 Next Trace Prediction b_id b_id1 b_id b_id3 branch history Hash Function Next trace_id tag index v tag Trace Table 1... w block_ids... Hit = w p red. block_ids to the block cache UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
13 Replicated Block Cache N = n word lines Instructions from the block fill unit block_id (n -bit) decoder FA i1 i ib... direct mapped cache copy-1 Block Cache b inst copy- copy-3 copy- Final Collapse 1 Buffer UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 5 Block Rename Table Block fetch address Tag Index v tag block_id N= entries = = Block fetch address renamed to a block_id UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
14 Block Cache Fragmentation Avg. # of inst. in fetch buffer ,,,3, 1,1 comp gcc go ijpeg li mk perl vort h-m Block size (b), Block fetch width (w) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 7 Block Cache Replication Avg. # of inst. in fetch buffer 1 1 (b=, N=9) w= w=3 w= w=5 w=1 comp gcc go ijpeg li mk perl vort benchmark UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
15 Trace Table Size Hit rate (%) k k k comp gcc go ijpeg li mk perl vort benchmark UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 9 Block Cache Hit Rate 1% % % % % % Number of blocks fetched hit miss no pred comp gcc go ijpeg li mk perl vort UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3
16 Performance vs. Block Cache Size compress k k k k 1k Block cache entries (N) gcc k k k k 1k Block cache entries (N) go k k k k 1k Block cache entries (N) ijpeg k k k k 1k Block cache entries (N) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 31 Performance vs. Block Cache Size li k k k k 1k Block cache entries (N) mksim k k k k 1k Block cache entries (N) perl k k k k 1k Block cache entries (N) vortex k k k k 1k Block cache entries (N) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3
17 Aggregate vs. Block Cache Size harmonic mean 1 1 block cache data dep endencies/instruction window lim it 1 mispredictions 1 perfect fetch block cache capacity misses & fragmentation perfect branch predict taken branch boundary baseline branch predict k k k k 1k branch misp rediction perfect block predict Block cache entries (N) real block predict UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 33 Conventional vs. Block-Based compress 1 1 block based conventional bytes 1 1 block based conventional 1 1 gcc gcc bytes block based conventional bytes 1 go 1 ijpeg 1 1 block based conventional bytes block based conventional bytes UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3
18 Conventional vs. Block-Based 1 li 1 mksim 1 1 block based conventional block based conventional bytes bytes 1 perl 1 vortex 1 1 block based conventional bytes block based conventional bytes UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 35 Aggregate Comparison 1 harmonic mean 1 block based conventional bytes UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3
19 Where Do We Go From Here... New Forcing Functions: High Frequency Instruction Level Parallelism Simultaneously optimize and MHz Silicon and Time Efficient Performance (STEP) Increase both /die_area and δ()/δ(year) Dynamic/Static and Hardware/Software fusion Reconcile generational latencies of H/W and S/W Bridge compiled object code & machine executable UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 37 How to have your cake and eat it too SPECint95 SPECint Average.5 Pipe Depth 3 MHz MHz 1 Pipe Width UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 3
20 Journey of Many Small Steps Improvement Major microarchitecture generations Need to Increase δ()/δ(year) UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page 39 Get the Best of Both Worlds Microarchitecture generational latency: -5 years Compiler generational latency: -1 years (?) Compiler Object Code Code Transformation Executable Image Execution Core UT Austin -- Distinguished Lecture Series on Computer Architecture -- April, 1999 Page
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts
More informationLecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo
Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998 DAP.F96 1 Review: Tomasulo Prevents Register as bottleneck Avoids WAR, WAW hazards
More informationJohn P. Shen Microprocessor Research Intel Labs March 19, 2002
&6 0LFURDUFKLWHFWXUH 6XSHUVFDODU 3URFHVVRU'HVLJQ John P. Shen Microprocessor Research Intel Labs March 19, 2002 (john.shen@intel.com) 0RRUH V /DZ&RQWLQXHV«Transistors (MT) 10,000 1,000 100 10 1 0.1 0.01
More informationInherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman
Inherently Lower Complexity Architectures using Dynamic Optimization Michael Gschwind Erik Altman ÿþýüûúùúüø öõôóüòñõñ ðïîüíñóöñð What is the Problem? Out of order superscalars achieve high performance....butatthecostofhighhigh
More informationLecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998
Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998 DAP.F96 1 Review: Tomasulo Prevents Register as bottleneck Avoids WAR, WAW hazards
More informationEvaluation of a High Performance Code Compression Method
Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The
More informationMultithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others
Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationCISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?
CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional
More informationOverview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class
Overview of Today s Lecture: Cost & Price, Performance EE176-SJSU Computer Architecture and Organization Lecture 2 Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class EE176
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationNOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance
CISC 66 Graduate Computer Architecture Lecture 8 - Cache Performance Michela Taufer Performance Why More on Memory Hierarchy?,,, Processor Memory Processor-Memory Performance Gap Growing Powerpoint Lecture
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationEE382A Lecture 3: Superscalar and Out-of-order Processor Basics
EE382A Lecture 3: Superscalar and Out-of-order Processor Basics Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 3-1 Announcements HW1 is due today Hand
More informationAdministrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies
Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu
More informationComputer Performance Evaluation: Cycles Per Instruction (CPI)
Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: where: Clock rate = 1 / clock cycle A computer machine
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationWide Instruction Fetch
Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More informationReview Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction
CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationDynamic Memory Scheduling
Dynamic : Computer Architecture Dynamic Scheduling Overview Lecture Overview The Problem Conventional Solutions In-order Load/Store Scheduling Blind Dependence Scheduling Conservative Dataflow Scheduling
More informationCS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction
Outline CMSC Computer Systems Architecture Lecture 9 Instruction Level Parallelism (Static & Dynamic Branch ion) ILP Compiler techniques to increase ILP Loop Unrolling Static Branch ion Dynamic Branch
More informationStatistical Simulation of Superscalar Architectures using Commercial Workloads
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information Systems (ELIS) Ghent University, Belgium CAECW
More informationPage 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP
CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause
More informationAnnotated Memory References: A Mechanism for Informed Cache Management
Annotated Memory References: A Mechanism for Informed Cache Management Alvin R. Lebeck, David R. Raymond, Chia-Lin Yang Mithuna S. Thottethodi Department of Computer Science, Duke University http://www.cs.duke.edu/ari/ice
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationLecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine
More informationLecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin
Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 1999 Exam Average 76 90-100 4 80-89 3 70-79 3 60-69 5 < 60 1 Admin
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationComputing Along the Critical Path
Computing Along the Critical Path Dean M. Tullsen Brad Calder Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0114 UCSD Technical Report October 1998
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last
More informationLecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization
More informationInstruction Level Parallelism. Taken from
Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction
More informationImproving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion
Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationThe Von Neumann Computer Model
The Von Neumann Computer Model Partitioning of the computing engine into components: Central Processing Unit (CPU): Control Unit (instruction decode, sequencing of operations), Datapath (registers, arithmetic
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationIn its brief lifetime of 26 years, the microprocessor
Theme Feature Superspeculative Microarchitecture for Beyond AD 2000 Employing a broad spectrum of superspeculative techniques can achieve significant performance increases over today s top-of-the-line
More informationECE/CS 552: Introduction to Superscalar Processors
ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines
More informationAppendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationCalibration of Microprocessor Performance Models
Calibration of Microprocessor Performance Models Bryan Black and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA, 15213 {black,shen}@ece.cmu.edu
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationSpeculative Execution via Address Prediction and Data Prefetching
Speculative Execution via Address Prediction and Data Prefetching José González, Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain Email:
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationVirtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts
Lecture 16 Virtual memory why? Virtual memory: Virtual memory concepts (5.10) Protection (5.11) The memory hierarchy of Alpha 21064 (5.13) Virtual address space proc 0? s space proc 1 Physical memory Virtual
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationExceeding the Dataflow Limit via Value Prediction
Copyright 1996 IEEE. Published in the Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, Dec. 2-4, 1996, Paris, France. Personal use of this material is permitted. However,
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/
More informationCache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time
Cache Performance!! Memory system and processor performance:! CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st = Pipeline time +
More informationSPECULATIVE MULTITHREADED ARCHITECTURES
2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More information15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project
More informationInstruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.
Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update
More informationCS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)
cs 61C L23 performance.1 CS61C Performance Lecture 23 April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html Outline Review HP-PA, Intel 80x86 instruction
More informationReview: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction
Review: Evaluating Branch Alternatives Lecture 3: Introduction to Advanced Pipelining Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier Pipeline speedup
More informationDynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers
Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the
More informationImproving Value Prediction by Exploiting Both Operand and Output Value Locality. Abstract
Improving Value Prediction by Exploiting Both Operand and Output Value Locality Youngsoo Choi 1, Joshua J. Yi 2, Jian Huang 3, David J. Lilja 2 1 - Department of Computer Science and Engineering 2 - Department
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationPerformance, Cost and Amdahl s s Law. Arquitectura de Computadoras
Performance, Cost and Amdahl s s Law Arquitectura de Computadoras Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Arquitectura de Computadoras Performance-
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationJim Keller. Digital Equipment Corp. Hudson MA
Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership
More informationPredict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch
branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)
More informationCS422 Computer Architecture
CS422 Computer Architecture Spring 2004 Lecture 17, 26 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Speculation Wish to move instructions across branches
More informationThe Alpha and Microprocessors:
The Alpha 21364 and 21464 icroprocessors: Continuing the Performance Lead Beyond Y2K Shubu ukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury,
More informationTDT 4260 TDT ILP Chap 2, App. C
TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More informationMultiple-Banked Register File Architectures
Multiple-Banked Register File Architectures José-Lorenzo Cruz, Antonio González and Mateo Valero Nigel P. Topham Departament d Arquitectura de Computadors Siroyan Ltd Universitat Politècnica de Catalunya
More informationJ. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598
Trace-driven performance exploration of a PowerPC 601 workload on wide superscalar processors J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown
More informationFunctional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory
The Big Picture: Where are We Now? CS152 Computer Architecture and Engineering Lecture 18 The Five Classic Components of a Computer Processor Input Control Dynamic Scheduling (Cont), Speculation, and ILP
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationIntroduction to Pipelined Datapath
14:332:331 Computer Architecture and Assembly Language Week 12 Introduction to Pipelined Datapath [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 W12.1 Review:
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More informationMeasure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture
Chapter 2 Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, Computer
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationSimultaneous Multithreading and the Case for Chip Multiprocessing
Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture
More information