PowerPC 620 Case Study

Similar documents
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CS 152 Computer Architecture and Engineering

Case Study IBM PowerPC 620

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Advanced processor designs

Hardware Speculation Support

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Limitations of Scalar Pipelines

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Instruction Level Parallelism

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Hardware-Based Speculation

Metodologie di Progettazione Hardware-Software

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture-13 (ROB and Multi-threading) CS422-Spring

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS 152 Computer Architecture and Engineering

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Processor (IV) - advanced ILP. Hwansoo Han

Advanced Computer Architecture

Handout 2 ILP: Part B

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

5008: Computer Architecture

Architectures for Instruction-Level Parallelism

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CS / ECE 6810 Midterm Exam - Oct 21st 2008

The Processor: Instruction-Level Parallelism

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

E0-243: Computer Architecture

November 7, 2014 Prediction

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

ECE/CS 552: Introduction to Superscalar Processors

SUPERSCALAR AND VLIW PROCESSORS

CS 152, Spring 2011 Section 8

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Dynamic Memory Dependence Predication

Multiple Instruction Issue and Hardware Based Speculation

Instruction Level Parallelism (ILP)

Instruction Level Parallelism

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

CS146 Computer Architecture. Fall Midterm Exam

PowerPC 740 and 750

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Instruction-Level Parallelism and Its Exploitation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Wide Instruction Fetch

Chapter-5 Memory Hierarchy Design

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

Superscalar Processor Design

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Multithreaded Processors. Department of Electrical Engineering Stanford University

EC 513 Computer Architecture

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Static Branch Prediction

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CS425 Computer Systems Architecture

Computer Science 146. Computer Architecture

Superscalar Processors

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Complex Pipelines and Branch Prediction

Techniques for Efficient Processing in Runahead Execution Engines

Getting CPI under 1: Outline

Dynamic Control Hazard Avoidance

ECE 341. Lecture # 15

Chapter 4 The Processor (Part 4)

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Portland State University ECE 587/687. Superscalar Issue Logic

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Itanium 2 Processor Microarchitecture Overview

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Transcription:

Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance Aggressive goals, targets Interesting microarchitectural features Hopelessly delayed Led to future, successful designs IBM/Motorola/Apple Alliance Alliance begun in 99 with a joint design center (Somerset) in Austin Ambitious objective: unseat Intel on the desktop Delays, conflicts, politics hasn t happened, alliance largely dissolved today PowerPC 60 Quick design based on RSC compatible with POWER and PowerPC PowerPC 603 Low power implementation designed for small uniprocessor systems 5 FUs: branch, integer, system, load/store, FP PowerPC 604 4-wide machine 6 FUs, each with -entry RS PowerPC 60 First 64-bit machine, also 4-wide Same 6 FUs as 604 Next slide, also chapter 5 in the textbook PowerPC G3, G4 Newer derivatives of the PowerPC 603 (3-issue, in-order) Added Altivec multimedia extensions PowerPC 60 PowerPC 60 Joint IBM/Apple/Motorola design Aggressively out-of-order, weak memory order, 64 bits Hopelessly delayed, very few shipped, but influenced later designs

PowerPC 60 Pipeline PowerPC 60 Pipeline 4-wide, BTAC simple predictor Instruction Buffer Decouples fetch from dispatch stalls Holds up to 8 instructions Dispatch Stage Rename Allocate: rename buffer, completion buffer Dispatch to reservation station Branches: resolve (if operands avail.) or predict with BHT Reservation Stations to 4 entries per functional unit, depending on type RS holds instruction payload, operands PowerPC 60 Pipeline PowerPC 60 Pipeline Execute Stage Six functional units Execute, bypass to waiting RS entries, write rename buffer Completion Buffer Sixteen entries, holds instruction state until in-order completion Complete Stage Maintains precise exceptions by buffering out-of-order instructions 4-wide Writeback Stage In-order writeback: results copied from rename buffer to architected register file

Benchmark Performance Benchmarks Dynamic Instructions Execution Cycles IPC compress 6,884,47 6,06,494.4 Eqntott 3,47,33,88,33.44 espresso 4,65,085 3,4,653.35 Li 3,376,45 3,399,93 0.99 alvinn 4,86,38,744,098.77 hydrod 4,4,60 4,93,30 0.96 tomcatv 6,858,69 6,494,9.06 Branch Prediction Two-phase branch prediction BTAC Holds targets of taken branches only On miss, fetch sequential (not-taken) path Accessed in single cycle in fetch stage Generates fetch address for next cycle 56 entries, -way set-associative BHT Accessed in dispatch stage 048 entries of -bit counters (bimodal) Also attempts to resolve branches at dispatch Interactions {BTAC right, wrong} x {BHT right, wrong} = 4 cases BHT overrides BTAC Branch Prediction Accuracy compress eqntott espresso li alvinn hydrod tomcatv BranchResolution Not Taken 40.35% 3.84% 40.05% 33.09% 6.38% 7.5% 6.% Taken 59.65% 68.6% 59.95% 66.9% 93.6% 8.49% 93.88% BTACPrediction Correct 84. 8.64% 8.99% 74.7 94.49% 88.3% 93.3% Incorrect 5.9 7.36% 8.0% 5.3 5.5%.69% 6.69% BHT Prediction Resolved 9.7% 8.3 7.09% 8.83% 7.49% 6.8% 45.39% Correct 68.86% 7.6% 7.7% 6.45% 8.58% 68.0 5.56% Incorrect.43% 9.54% 0.64% 8.7% 0.9% 5.8%.05% BTAC Incorrect and BHT Correct 0.0% 0.79%.3% 7.78% 0.07% 0.9% 0.0 BTAC Correct and BHT Incorrect 0.0 0.% 0.37% 0.6% 0.0 0.08% 0.0 Overall Branch Prediction Accuracy 88.57% 90.46% 89.36% 9.8% 99.07% 94.8% 97.95% Wasted Fetch Cycles Benchmark Misprediction I-Cache Miss compress 6.65% 0.0% eqntott.78% 0.08% espresso 0.84% 0.5% li 8.9% 0.09% alvinn 0.39% 0.0% hydrod 5.4% 0.% tomcatv 0.68% 0.0% 3

li li (a) Instruction (b) Completion Buffer Utilization eqntott compress 6 Dispatch Stalls Instruction buffer Decouples fetch/dispatch Completion buffer Supports OOO execution espresso tomcatv hydrod alvinn 6 6 6 6 6 6 Frequency of dispatch stall cycles. Sources of Dispatch Stalls compress eqntott espresso li alvinn hydrod tomcatv Serialization0.0 0.0 0.0 0.0 0.0 0.0 0.0 Move to special register constraint 0.0 4.49% 0.94% 3.44% 0.0 0.95% 0.08% Read port saturation 0.6% 0.0 0.0% 0.0 0.3%.3% 6.73% Reservation station saturation 36.07%.36% 3.5 34.4.8% 4.7 36.5% Rename buffer saturation 4.06% 7.6 3.93% 7.6%.36% 6.98% 34.3% Completion buffer saturation 5.54% 3.64%.0% 4.7%.% 7.8 9.03% Another to same unit 9.7% 0.5% 8.3% 0.57% 4.3.0% 7.7% No dispatch stalls 4.35% 4.4 33.8% 30.06% 30.09% 7.33% 6.35% Issue Stalls Frequency of issue stall cycles. Sources of Issue Stalls compress eqntott espresso li alvinn hydrod tomcatv Out of order disallowed 0.0 0.0 0.0 0.0 0.7%.03%.53% Serialization.69%.8% 3.% 0.8% 0.03% 4.47% 0.0% Waiting for source.97% 9.3 37.79% 3.03% 7.74%.7% 3.5% Waiting for execution unit 3.67% 3.8% 7.06%.0%.8%.5.3 No issue stalls 6.67% 65.6% 5.94% 46.5% 78.7 60.9% 93.64% Parallelism Achieved compress eqntott espresso alvinn hydrod 3 3 3 3 3 3 (a) Dispatching (b) Issuing (c) Finishing (d) Completion 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 tomcatv 3 3 3 3 4

Summary of PowerPC 60 First-generation OOO part Aggressive goals, poor execution Interesting contributions Two-phase branch prediction (also in 604) Short pipeline Weak ordering of memory references PowerPC evolution 998: Power3 (630FP) 00: Power4 004: Power5 60 vs. Power3 vs. Power4 Attribute Frequency Pipeline depth Branch predictor Fetch/issue/comple tion width Rename/physical registers In-flight instructions FP Units Load/store units Instruction Cache Data Cache L/L3 size L bandwidth Store queue entries MSHRs Hardware prefetch 60 7 MHz 5+ Bimodal BHT + BTAC 4/6/4 8 Int, 8 FP 6 3K 8w SA 3K 8w SA 4M GB/s 6 x 8B I:/D: None Power3 450 MHz 5+ Same 4/8/4 6 Int, 4 FP 3 3K 8w SA 64K 8w SA 6M 6.4GB/s 6 x 8B I:/D:4 4 streams Power4.3 GHz 5+ 3x6 b combining 4/8/5 80 Int, 7 FP Up to 00 64K DM 3K w SA.5M/3M 00+ GB/s x 64B I:/D:8 8 streams IBM Power4 IBM POWER4, began shipping in 00 Deep pipeline: 5 stages minimum Aggressive combining branch prediction Over 00 instructions in flight, tracked in 0 groups of 5 in ROB Aggressive memory hierarchy, memory bandwidth 5