J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY PDF Free Download

Trace-driven performance exploration of a PowerPC 601 workload on wide superscalar processors J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 Why yet another superscalar processor model and performance evaluation? Superscalar processors continue dominating the field No apparent likelihood of ending superscalar paradigm in near future Continuing improvements in features and capabilities Certain aspects getting easier due to number of transistors available Existing programs (binary compatibility) Need for evaluating new implementation challenges High frequency objectives New structures and algorithms Wider instruction issue Need to understand impact of various classes of workloads "Commercial" workloads... Moreno et al., 01/26/98 1-2

The MET: Microarchitecture Exploration Toolset Collection of tools for exploration of microarchitecture features Trace-driven and execution-driven tools Fast simulation: >300 M inst/hour Intended to support early exploration of processor organizations detailed model of generalized pipeline trends among results instead of their magnitudes Processor organization I-TLB L1-I cache NFA/Branch Predictor I-Fetch I-Buffer I-Prefetch Decode/ Expand L2 cache Rename/ Dispatch Main Memory Issue queue Integer Issue queue Load/store Issue queue Float.Point Issue queue Branch Issue logic Issue logic Issue logic Issue logic Reg. read Reg. read Reg. read Reg. read Cast-out queue Integer units Load/store units Floating-Point units Branch units L1-D cache Load/store reorder buffer D-TLB Store queue Retirement queue TLB2 Miss queue Retirement logic Moreno et al., 01/26/98 3-4

Pipeline stages Integer Fetch Decode Expand Rename Dispatch Issue Read Exec WB Retire Load Fetch Decode Expand Rename Dispatch Issue Read EA Dcache access WB Retire Floating point Fetch Decode Expand Rename Dispatch Issue Read Exec1 Exec2 Exec3 WB Retire and traces Length 172 M instructions, user and kernel space 1212 M instructions, user space Branch instructions 18.9 % 21.6 % Branches taken 44.3 % 56.3 % Instrs. in kernel space 22.1 % n/a Memory access instructions 34.8 % 27.1 % Load/store multiple instructions 1.6 % 1.6 % String instructions 1.4 % 0.3 % Load/store w/update instrs. 1.7 % 2.8 % Average block size 5.3 instrs. Mispredicted instructions, addresses No Yes Moreno et al., 01/26/98 5-6

adder for various processor configurations adder Finite/non-perfect Infinite/perfect 0.6 0.2 gcc95 Much larger adder in the case of workload Miss rates (per 1000 instructions): 64K L1, 2M L2, 128 entries TLB, 8K entries BHT I1 21.3 8.3 I2 3.1 2 D1 22.4 9.8 D2 1.8 2 I TLB 1.8 ~0 D TLB 4.5 1.5 Conditional branch misprediction 5.3 % 7.0 % Moreno et al., 01/26/98 7-8

Exploration space (in this presentation) Issue policy Width Fetch/Dispatch/Retire Cache size L1-I, L1-D, L2 Branch prediction Class-order 4/4/6 64K, 64K, 2M 8192 entry BHT, 4096 BTAC Out-of-order 8/8/12 128K, 128K, 4M Perfect 12/12/16 128K, 128K, Inf Inf, Inf, Inf Widths Units Ports Queues Physical registers Fetch/ FX/FP/LS/BR Data cache Issue/ GPR/FPR/CCR/SPR Dispatch/ Retire and TLB Retire/ IBuf 4/4/6 3/2/2/2 2 20(12)/128/24 80/80/32/64 8/8/12 6/4/4/4 4 40/160/48 128/128/64/96 12/12/16 8/4/6/4 6 60/160/72 128/128/64/96 Other parameters (examples) Sizes I-prefetch buffer (entries) 4 Miss queue, cast-out queue (entries) 8 Store queue, reorder buffer (entries) 31 D/I-TLBs (entries) 128 TLB2 (entries) 1024 L1-I/D, L2-cache line size (bytes) 128 Page size (bytes) 4096 Latencies I-prefetch buffer latency (cycles) 1 D/I-TLBs miss penalty (cycles) 4 TLB2 miss penalty (cycles) 40 L1-I/D cache miss penalty (cycles) 8. 7 L2 cache miss penalty (cycles) 40 Branch prediction BTAC (entries) 4096 LR stack size (entries) 32 Branch history table (entries) 8192 Moreno et al., 01/26/98 9-10

adders due to issue policy (as % of base case) 1.5 0.5 Class-order Out-of-order 35 99 125 21 45 53 21 40 47 18 34 38 21 60 70 16 35 40 16 33 37 15 29 32 4InfPf 8InfPf 12InfPf 4IL2Pf 8IL2Pf 12IL2Pf 4LgPf 8LgPf 12LgPf 4StPf 8StPf 12StPf 4InfBp 8InfBp 12InfBp 4IL2Bp 8IL2Bp 12IL2Bp 4LgBp 8LgBp 12LgBp 4StBp 8StBp 12StBp Class-order Out-of-order 74 152 209 73 127 154 30 51 59 30 48 53 4InfPf 8InfPf 12InfPf 4StPf 8StPf 12StPf 4InfBp 8InfBp 12InfBp 4StBp 8StBp 12StBp adders due to branch prediction (as % of base case) 1.5 0.5 13 Imperfect Perfect 14 16 26 42 54 15 18 21 21 27 33 14 17 19 19 24 28 15 18 20 18 22 26 c4inf c8inf c12inf o4inf o8inf o12inf c4il2 c8il2 c12il2 o4il2 o8il2 o12il2 c4lg c8lg c12lg o4lg o8lg o12lg c4st c8st c12st o4st o8st o12st 0.6 21 22 25 63 103 143 Imperfect Perfect 19 23 24 58 88 107 0.2 c4inf c8inf c12inf o4inf o8inf o12inf c4st c8st c12st o4st o8st o12st Moreno et al., 01/26/98 11-12

adders due to cache size (as % of base case) 1.5 St Lg IL2 Inf 0.5 29 31 31 44 79 92 31 35 36 38 60 66 c4pf c8pf c12pf o4pf o8pf o12pf c4bp c8bp c12bp o4bp o8bp o12bp c4pf c8pf c12pf o4pf o8pf o12pf c4bp c8bp c12bp o4bp o8bp o12bp 0.6 0.2 St Inf adders due to processor width (as % of base case) 1.5 0.5 16 w=4 w=8 71 w=12 14 37 13 32 13 27 15 52 12 30 10 27 10 23 cinfpf oinfpf cil2pf oil2pf clgpf olgpf cstpf ostpf cinfbp oinfbp cil2bp oil2bp clgbp olgbp cstbp ostbp 0.6 10 w=4 w=8 w=12 11 9 27 7 23 59 54 0.2 cinfpf oinfpf cstpf ostpf cinfbp oinfbp cstbp ostbp Moreno et al., 01/26/98 13-14

In workload "Least-aggressive" configurations considered 15 to 32% degradation due to class-order issue more severe degradation expected for in-order policy 15 to 26% degradation due to imperfect branch predictor 30 to 66% degradation due to finite L1 cache (128K) 10 to 23% degradation due to processor width Diminishing benefits beyond dispatching eight operations per cycle conventional instruction fetching mechanism Still many microarchitecture issues to investigate in detail Observations Clear differences in behavior relative to memory penalties in shadow other effects Caveats due to use of traces length number of traces (just one in this presentation) observability in no mispredicted paths time scaling in no kernel code Moreno et al., 01/26/98 15-16

Summary Environment for early exploration fast, flexible trends among aggressive superscalar organizations Behavior of workload very different from others (i.e., SPEC) different microarchitecture tradeoffs Aggressive superscalar buildable? need to quantify potential performance from realizable implementation need to identify/develop features that provide better return results in workload Issue policy Width Bp: 2-bit branch history table Pf: Perfect branch predictor (8192 entries) Inf IL2 Lg St Inf IL2 Lg St c: Class-order 4 2 7 1.18 9 0.72 0.93 3 1.12 8 0.71 0.96 7 1.18 0.62 1 0.91 0 12 0.70 0.95 6 1.17 0.60 0.79 9 0.97 o: Out-of-order 4 0.67 0.93 2 1.12 0.53 0.77 6 0.95 8 4 0.71 1 0.91 0.31 0.56 0.65 0.75 12 1 0.68 0.77 8 0.27 0.51 0.60 0.70 Moreno et al., 01/26/98 17-18