J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598

Similar documents
J.H. Moreno, 11/10/99 1-2

Case Study IBM PowerPC 620

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

E0-243: Computer Architecture

Superscalar Processor Design

LIMITS OF ILP. B649 Parallel Architectures and Programming

PowerPC 620 Case Study

" # " $ % & ' ( ) * + $ " % '* + * ' "

Lecture-13 (ROB and Multi-threading) CS422-Spring

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Architectures for Instruction-Level Parallelism

Microarchitecture Overview. Performance

Dynamic Scheduling. CSE471 Susan Eggers 1

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Limitations of Scalar Pipelines

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Microarchitecture Overview. Performance

Speculation and Future-Generation Computer Architecture

CS146 Computer Architecture. Fall Midterm Exam

PowerPC 740 and 750

Hardware-Based Speculation

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Understanding The Effects of Wrong-path Memory References on Processor Performance

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

CS 152 Computer Architecture and Engineering

Pipelining. CSC Friday, November 6, 2015

Hardware-based Speculation

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer Architecture EE 4720 Final Examination

Multithreaded Processors. Department of Electrical Engineering Stanford University

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Chapter 4. The Processor

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Superscalar Processors

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Metodologie di Progettazione Hardware-Software

Advanced Processor Architecture

1. PowerPC 970MP Overview

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

SUPERSCALAR AND VLIW PROCESSORS

Simultaneous Multithreading and the Case for Chip Multiprocessing

John P. Shen Microprocessor Research Intel Labs March 19, 2002

Portland State University ECE 587/687. Superscalar Issue Logic

Superscalar Organization

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

EC 513 Computer Architecture

Handout 2 ILP: Part B

Superscalar Processor

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

November 7, 2014 Prediction

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

LECTURE 3: THE PROCESSOR

The Use of Multithreading for Exception Handling

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Static, multiple-issue (superscaler) pipelines

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Instruction Level Parallelism

HW1 Solutions. Type Old Mix New Mix Cost CPI

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS152 Computer Architecture and Engineering. Complex Pipelines

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Selective Dual Path Execution

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

A Realistic Study on Multithreaded Superscalar Processor Design

Portland State University ECE 588/688. Cray-1 and Cray T3E

Chapter 4. The Processor

Multiple Instruction Issue. Superscalars

Jim Keller. Digital Equipment Corp. Hudson MA

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

ECE404 Term Project Sentinel Thread

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Optimum Pipeline Depth for a Microprocessor

Itanium 2 Processor Microarchitecture Overview

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

6.823 Computer System Architecture

Transcription:

Trace-driven performance exploration of a PowerPC 601 workload on wide superscalar processors J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 Why yet another superscalar processor model and performance evaluation? Superscalar processors continue dominating the field No apparent likelihood of ending superscalar paradigm in near future Continuing improvements in features and capabilities Certain aspects getting easier due to number of transistors available Existing programs (binary compatibility) Need for evaluating new implementation challenges High frequency objectives New structures and algorithms Wider instruction issue Need to understand impact of various classes of workloads "Commercial" workloads... Moreno et al., 01/26/98 1-2

The MET: Microarchitecture Exploration Toolset Collection of tools for exploration of microarchitecture features Trace-driven and execution-driven tools Fast simulation: >300 M inst/hour Intended to support early exploration of processor organizations detailed model of generalized pipeline trends among results instead of their magnitudes Processor organization I-TLB L1-I cache NFA/Branch Predictor I-Fetch I-Buffer I-Prefetch Decode/ Expand L2 cache Rename/ Dispatch Main Memory Issue queue Integer Issue queue Load/store Issue queue Float.Point Issue queue Branch Issue logic Issue logic Issue logic Issue logic Reg. read Reg. read Reg. read Reg. read Cast-out queue Integer units Load/store units Floating-Point units Branch units L1-D cache Load/store reorder buffer D-TLB Store queue Retirement queue TLB2 Miss queue Retirement logic Moreno et al., 01/26/98 3-4

Pipeline stages Integer Fetch Decode Expand Rename Dispatch Issue Read Exec WB Retire Load Fetch Decode Expand Rename Dispatch Issue Read EA Dcache access WB Retire Floating point Fetch Decode Expand Rename Dispatch Issue Read Exec1 Exec2 Exec3 WB Retire and traces Length 172 M instructions, user and kernel space 1212 M instructions, user space Branch instructions 18.9 % 21.6 % Branches taken 44.3 % 56.3 % Instrs. in kernel space 22.1 % n/a Memory access instructions 34.8 % 27.1 % Load/store multiple instructions 1.6 % 1.6 % String instructions 1.4 % 0.3 % Load/store w/update instrs. 1.7 % 2.8 % Average block size 5.3 instrs. Mispredicted instructions, addresses No Yes Moreno et al., 01/26/98 5-6

adder for various processor configurations adder Finite/non-perfect Infinite/perfect 0.6 0.2 gcc95 Much larger adder in the case of workload Miss rates (per 1000 instructions): 64K L1, 2M L2, 128 entries TLB, 8K entries BHT I1 21.3 8.3 I2 3.1 2 D1 22.4 9.8 D2 1.8 2 I TLB 1.8 ~0 D TLB 4.5 1.5 Conditional branch misprediction 5.3 % 7.0 % Moreno et al., 01/26/98 7-8

Exploration space (in this presentation) Issue policy Width Fetch/Dispatch/Retire Cache size L1-I, L1-D, L2 Branch prediction Class-order 4/4/6 64K, 64K, 2M 8192 entry BHT, 4096 BTAC Out-of-order 8/8/12 128K, 128K, 4M Perfect 12/12/16 128K, 128K, Inf Inf, Inf, Inf Widths Units Ports Queues Physical registers Fetch/ FX/FP/LS/BR Data cache Issue/ GPR/FPR/CCR/SPR Dispatch/ Retire and TLB Retire/ IBuf 4/4/6 3/2/2/2 2 20(12)/128/24 80/80/32/64 8/8/12 6/4/4/4 4 40/160/48 128/128/64/96 12/12/16 8/4/6/4 6 60/160/72 128/128/64/96 Other parameters (examples) Sizes I-prefetch buffer (entries) 4 Miss queue, cast-out queue (entries) 8 Store queue, reorder buffer (entries) 31 D/I-TLBs (entries) 128 TLB2 (entries) 1024 L1-I/D, L2-cache line size (bytes) 128 Page size (bytes) 4096 Latencies I-prefetch buffer latency (cycles) 1 D/I-TLBs miss penalty (cycles) 4 TLB2 miss penalty (cycles) 40 L1-I/D cache miss penalty (cycles) 8. 7 L2 cache miss penalty (cycles) 40 Branch prediction BTAC (entries) 4096 LR stack size (entries) 32 Branch history table (entries) 8192 Moreno et al., 01/26/98 9-10

adders due to issue policy (as % of base case) 1.5 0.5 Class-order Out-of-order 35 99 125 21 45 53 21 40 47 18 34 38 21 60 70 16 35 40 16 33 37 15 29 32 4InfPf 8InfPf 12InfPf 4IL2Pf 8IL2Pf 12IL2Pf 4LgPf 8LgPf 12LgPf 4StPf 8StPf 12StPf 4InfBp 8InfBp 12InfBp 4IL2Bp 8IL2Bp 12IL2Bp 4LgBp 8LgBp 12LgBp 4StBp 8StBp 12StBp Class-order Out-of-order 74 152 209 73 127 154 30 51 59 30 48 53 4InfPf 8InfPf 12InfPf 4StPf 8StPf 12StPf 4InfBp 8InfBp 12InfBp 4StBp 8StBp 12StBp adders due to branch prediction (as % of base case) 1.5 0.5 13 Imperfect Perfect 14 16 26 42 54 15 18 21 21 27 33 14 17 19 19 24 28 15 18 20 18 22 26 c4inf c8inf c12inf o4inf o8inf o12inf c4il2 c8il2 c12il2 o4il2 o8il2 o12il2 c4lg c8lg c12lg o4lg o8lg o12lg c4st c8st c12st o4st o8st o12st 0.6 21 22 25 63 103 143 Imperfect Perfect 19 23 24 58 88 107 0.2 c4inf c8inf c12inf o4inf o8inf o12inf c4st c8st c12st o4st o8st o12st Moreno et al., 01/26/98 11-12

adders due to cache size (as % of base case) 1.5 St Lg IL2 Inf 0.5 29 31 31 44 79 92 31 35 36 38 60 66 c4pf c8pf c12pf o4pf o8pf o12pf c4bp c8bp c12bp o4bp o8bp o12bp c4pf c8pf c12pf o4pf o8pf o12pf c4bp c8bp c12bp o4bp o8bp o12bp 0.6 0.2 St Inf adders due to processor width (as % of base case) 1.5 0.5 16 w=4 w=8 71 w=12 14 37 13 32 13 27 15 52 12 30 10 27 10 23 cinfpf oinfpf cil2pf oil2pf clgpf olgpf cstpf ostpf cinfbp oinfbp cil2bp oil2bp clgbp olgbp cstbp ostbp 0.6 10 w=4 w=8 w=12 11 9 27 7 23 59 54 0.2 cinfpf oinfpf cstpf ostpf cinfbp oinfbp cstbp ostbp Moreno et al., 01/26/98 13-14

In workload "Least-aggressive" configurations considered 15 to 32% degradation due to class-order issue more severe degradation expected for in-order policy 15 to 26% degradation due to imperfect branch predictor 30 to 66% degradation due to finite L1 cache (128K) 10 to 23% degradation due to processor width Diminishing benefits beyond dispatching eight operations per cycle conventional instruction fetching mechanism Still many microarchitecture issues to investigate in detail Observations Clear differences in behavior relative to memory penalties in shadow other effects Caveats due to use of traces length number of traces (just one in this presentation) observability in no mispredicted paths time scaling in no kernel code Moreno et al., 01/26/98 15-16

Summary Environment for early exploration fast, flexible trends among aggressive superscalar organizations Behavior of workload very different from others (i.e., SPEC) different microarchitecture tradeoffs Aggressive superscalar buildable? need to quantify potential performance from realizable implementation need to identify/develop features that provide better return results in workload Issue policy Width Bp: 2-bit branch history table Pf: Perfect branch predictor (8192 entries) Inf IL2 Lg St Inf IL2 Lg St c: Class-order 4 2 7 1.18 9 0.72 0.93 3 1.12 8 0.71 0.96 7 1.18 0.62 1 0.91 0 12 0.70 0.95 6 1.17 0.60 0.79 9 0.97 o: Out-of-order 4 0.67 0.93 2 1.12 0.53 0.77 6 0.95 8 4 0.71 1 0.91 0.31 0.56 0.65 0.75 12 1 0.68 0.77 8 0.27 0.51 0.60 0.70 Moreno et al., 01/26/98 17-18