Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance Aggressive goals, targets Interesting microarchitectural features Hopelessly delayed Led to future, successful designs IBM/Motorola/Apple Alliance Alliance begun in 99 with a joint design center (Somerset) in Austin Ambitious objective: unseat Intel on the desktop Delays, conflicts, politics hasn t happened, alliance largely dissolved today PowerPC 60 Quick design based on RSC compatible with POWER and PowerPC PowerPC 603 Low power implementation designed for small uniprocessor systems 5 FUs: branch, integer, system, load/store, FP PowerPC 604 4-wide machine 6 FUs, each with -entry RS PowerPC 60 First 64-bit machine, also 4-wide Same 6 FUs as 604 Next slide, also chapter 5 in the textbook PowerPC G3, G4 Newer derivatives of the PowerPC 603 (3-issue, in-order) Added Altivec multimedia extensions PowerPC 60 PowerPC 60 Joint IBM/Apple/Motorola design Aggressively out-of-order, weak memory order, 64 bits Hopelessly delayed, very few shipped, but influenced later designs
PowerPC 60 Pipeline PowerPC 60 Pipeline 4-wide, BTAC simple predictor Instruction Buffer Decouples fetch from dispatch stalls Holds up to 8 instructions Dispatch Stage Rename Allocate: rename buffer, completion buffer Dispatch to reservation station Branches: resolve (if operands avail.) or predict with BHT Reservation Stations to 4 entries per functional unit, depending on type RS holds instruction payload, operands PowerPC 60 Pipeline PowerPC 60 Pipeline Execute Stage Six functional units Execute, bypass to waiting RS entries, write rename buffer Completion Buffer Sixteen entries, holds instruction state until in-order completion Complete Stage Maintains precise exceptions by buffering out-of-order instructions 4-wide Writeback Stage In-order writeback: results copied from rename buffer to architected register file
Benchmark Performance Benchmarks Dynamic Instructions Execution Cycles IPC compress 6,884,47 6,06,494.4 Eqntott 3,47,33,88,33.44 espresso 4,65,085 3,4,653.35 Li 3,376,45 3,399,93 0.99 alvinn 4,86,38,744,098.77 hydrod 4,4,60 4,93,30 0.96 tomcatv 6,858,69 6,494,9.06 Branch Prediction Two-phase branch prediction BTAC Holds targets of taken branches only On miss, fetch sequential (not-taken) path Accessed in single cycle in fetch stage Generates fetch address for next cycle 56 entries, -way set-associative BHT Accessed in dispatch stage 048 entries of -bit counters (bimodal) Also attempts to resolve branches at dispatch Interactions {BTAC right, wrong} x {BHT right, wrong} = 4 cases BHT overrides BTAC Branch Prediction Accuracy compress eqntott espresso li alvinn hydrod tomcatv BranchResolution Not Taken 40.35% 3.84% 40.05% 33.09% 6.38% 7.5% 6.% Taken 59.65% 68.6% 59.95% 66.9% 93.6% 8.49% 93.88% BTACPrediction Correct 84. 8.64% 8.99% 74.7 94.49% 88.3% 93.3% Incorrect 5.9 7.36% 8.0% 5.3 5.5%.69% 6.69% BHT Prediction Resolved 9.7% 8.3 7.09% 8.83% 7.49% 6.8% 45.39% Correct 68.86% 7.6% 7.7% 6.45% 8.58% 68.0 5.56% Incorrect.43% 9.54% 0.64% 8.7% 0.9% 5.8%.05% BTAC Incorrect and BHT Correct 0.0% 0.79%.3% 7.78% 0.07% 0.9% 0.0 BTAC Correct and BHT Incorrect 0.0 0.% 0.37% 0.6% 0.0 0.08% 0.0 Overall Branch Prediction Accuracy 88.57% 90.46% 89.36% 9.8% 99.07% 94.8% 97.95% Wasted Fetch Cycles Benchmark Misprediction I-Cache Miss compress 6.65% 0.0% eqntott.78% 0.08% espresso 0.84% 0.5% li 8.9% 0.09% alvinn 0.39% 0.0% hydrod 5.4% 0.% tomcatv 0.68% 0.0% 3
li li (a) Instruction (b) Completion Buffer Utilization eqntott compress 6 Dispatch Stalls Instruction buffer Decouples fetch/dispatch Completion buffer Supports OOO execution espresso tomcatv hydrod alvinn 6 6 6 6 6 6 Frequency of dispatch stall cycles. Sources of Dispatch Stalls compress eqntott espresso li alvinn hydrod tomcatv Serialization0.0 0.0 0.0 0.0 0.0 0.0 0.0 Move to special register constraint 0.0 4.49% 0.94% 3.44% 0.0 0.95% 0.08% Read port saturation 0.6% 0.0 0.0% 0.0 0.3%.3% 6.73% Reservation station saturation 36.07%.36% 3.5 34.4.8% 4.7 36.5% Rename buffer saturation 4.06% 7.6 3.93% 7.6%.36% 6.98% 34.3% Completion buffer saturation 5.54% 3.64%.0% 4.7%.% 7.8 9.03% Another to same unit 9.7% 0.5% 8.3% 0.57% 4.3.0% 7.7% No dispatch stalls 4.35% 4.4 33.8% 30.06% 30.09% 7.33% 6.35% Issue Stalls Frequency of issue stall cycles. Sources of Issue Stalls compress eqntott espresso li alvinn hydrod tomcatv Out of order disallowed 0.0 0.0 0.0 0.0 0.7%.03%.53% Serialization.69%.8% 3.% 0.8% 0.03% 4.47% 0.0% Waiting for source.97% 9.3 37.79% 3.03% 7.74%.7% 3.5% Waiting for execution unit 3.67% 3.8% 7.06%.0%.8%.5.3 No issue stalls 6.67% 65.6% 5.94% 46.5% 78.7 60.9% 93.64% Parallelism Achieved compress eqntott espresso alvinn hydrod 3 3 3 3 3 3 (a) Dispatching (b) Issuing (c) Finishing (d) Completion 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 tomcatv 3 3 3 3 4
Summary of PowerPC 60 First-generation OOO part Aggressive goals, poor execution Interesting contributions Two-phase branch prediction (also in 604) Short pipeline Weak ordering of memory references PowerPC evolution 998: Power3 (630FP) 00: Power4 004: Power5 60 vs. Power3 vs. Power4 Attribute Frequency Pipeline depth Branch predictor Fetch/issue/comple tion width Rename/physical registers In-flight instructions FP Units Load/store units Instruction Cache Data Cache L/L3 size L bandwidth Store queue entries MSHRs Hardware prefetch 60 7 MHz 5+ Bimodal BHT + BTAC 4/6/4 8 Int, 8 FP 6 3K 8w SA 3K 8w SA 4M GB/s 6 x 8B I:/D: None Power3 450 MHz 5+ Same 4/8/4 6 Int, 4 FP 3 3K 8w SA 64K 8w SA 6M 6.4GB/s 6 x 8B I:/D:4 4 streams Power4.3 GHz 5+ 3x6 b combining 4/8/5 80 Int, 7 FP Up to 00 64K DM 3K w SA.5M/3M 00+ GB/s x 64B I:/D:8 8 streams IBM Power4 IBM POWER4, began shipping in 00 Deep pipeline: 5 stages minimum Aggressive combining branch prediction Over 00 instructions in flight, tracked in 0 groups of 5 in ROB Aggressive memory hierarchy, memory bandwidth 5