HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Similar documents
Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Itanium 2 Processor Microarchitecture Overview

Four-Way Superscalar PA-RISC Processors

Jim Keller. Digital Equipment Corp. Hudson MA

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

EECS 322 Computer Architecture Superpipline and the Cache

Microarchitecture Overview. Performance

Processor (IV) - advanced ILP. Hwansoo Han

Microarchitecture Overview. Performance

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Digital Leads the Pack with 21164

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Case Study IBM PowerPC 620

E0-243: Computer Architecture

The Processor: Instruction-Level Parallelism

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Keywords and Review Questions

The PA 7300LC Microprocessor: A Highly Integrated System on a Chip

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Handout 2 ILP: Part B

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Hardware-Based Speculation

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Multiple Instruction Issue. Superscalars

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

Chapter 4 The Processor 1. Chapter 4D. The Processor

COMPUTER ORGANIZATION AND DESI

Static vs. Dynamic Scheduling

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Portland State University ECE 587/687. Superscalar Issue Logic

Intel released new technology call P6P

The Alpha Microprocessor Architecture. Compaq Computer Corporation 334 South St., Shrewsbury, MA

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Design Objectives of the 0.35µm Alpha Microprocessor (A 500MHz Quad Issue RISC Microprocessor)

CS425 Computer Systems Architecture

A Microarchitectural Survey of Next Generation Microprocessors

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Multiple Issue ILP Processors. Summary of discussions

620 Fills Out PowerPC Product Line

Superscalar Processor Design

Pipelining and Vector Processing

Advanced Processor Architecture

Super Scalar. Kalyan Basu March 21,

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Limitations of Scalar Pipelines

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

5008: Computer Architecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

Chapter 4 The Processor (Part 4)

Pipelining to Superscalar

CS146 Computer Architecture. Fall Midterm Exam

1. PowerPC 970MP Overview

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multithreaded Processors. Department of Electrical Engineering Stanford University

UltraSparc-3 Aims at MP Servers

Complex Pipelines and Branch Prediction

Four Steps of Speculative Tomasulo cycle 0

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Chapter 13 Reduced Instruction Set Computers

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

The Pentium II/III Processor Compiler on a Chip

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

CS 152 Computer Architecture and Engineering

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CS 152 Computer Architecture and Engineering

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Multi-cycle Instructions in the Pipeline (Floating Point)

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

CS 152, Spring 2011 Section 8

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

" # " $ % & ' ( ) * + $ " % '* + * ' "

Advanced Instruction-Level Parallelism

Instruction Set Principles and Examples. Appendix B

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

MIPS R5000 Microprocessor. Technical Backgrounder. 32 kb I-cache and 32 kb D-cache, each 2-way set associative

New Advances in Micro-Processors and computer architectures

EC 513 Computer Architecture

Hardware Speculation Support

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Processing Unit CS206T

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS152 Computer Architecture and Engineering. Complex Pipelines

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

MICROPROCESSOR. Merced Shows Innovative Design. Static, Dynamic Elements Work in Synergy With Compiler

Transcription:

The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA ashok@cup.hp.com Hot Chips VIII Presentation - Page 1

Presentation Overview Design Objectives Hardware Highlights Chip Statistics Performance IRB Design Hot Chips VIII Presentation - Page 2

Design Objectives Leadership Performance Full Support for 64-bit Applications New PA 2.0 Architecture Binary compatibility with existing code Glueless Support for up to 4-way MP Hot Chips VIII Presentation - Page 3

PA-RISC 2.0 Enhancements New 64 Bit Architecture Wider Registers New computational units Virtual addressing Physical addressing Fast TLB insert instructions Load/Store instructions with 16-bit displacement Memory prefetch instructions Variable sized pages Multimedia half-word instructions Branch with 22-bit displacement, short pointer Branch prediction hinting Floating point multiply-accumulate FP multiple compare result bits Carefully selected others Hot Chips VIII Presentation - Page 4

Application Performance In order to achieve sustained performance on large applications one needs: Large Primary Caches Methods to hide Memory Latency Dynamic Instruction Reordering High Bandwidth System Bus RUNWAY: 768 MB/sec Split Transaction Incorporates support for multiple outstanding memory requests Hot Chips VIII Presentation - Page 5

Hardware Highlights Completely redesigned core/new microarchitecture 56 Entry Instruction Reorder Buffer (IRB) Peak execution rate of 4 instructions/cycle 8 Computational Units FPMAC 3 cycle latency, fully pipelined DIV/SQRT 17 cycle latency, not pipelined all others single cycle latency 2 Load/Store Units 32 Entry Branch Target Address Cache (fully associative) Zero state taken branch penalty for branches that hit in BTAC Branch Prediction Hardware 256 Entry Branch History Table Static or Dynamic Prediction Hot Chips VIII Presentation - Page 6

Cache Design No on-chip cache Single Level off-chip Split Instruction/Data up to 4M/4M Direct Mapped Uses industry standard synchronous SRAMs Two state pipelined access Hot Chips VIII Presentation - Page 7

Functional Block Diagram Instr. Cache Instruction Fetch Unit 64-bit Integer ALUs Sort System Bus Interface Runway bus Shift/ Merge Units FMAC Units Instruction Reorder Buffer ALU Buffer 28 entries Memory Buffer 28 entries Ld/St Address Adders Address Reorder Buffer 28 entries Data Cache Divide/ SQRT Units Retire Rename Registers Rename Registers Architected Registers Hot Chips VIII Presentation - Page 8

Chip Statistics Fabricated in HP's 0.5 micron, 3.3V CMOS Process 0.28 um L eff 5 metal layers Die size: 17.68 mm x 19.1 mm Transistor Count: 3.8 million Flip-Chip Packaging Technology 704 signals, 1,200 Power/Ground bumps 1,085 pin package Ceramic Land Grid Array Hot Chips VIII Presentation - Page 9

Die Photo System Bus Interface Integer Register Stacks Integer Functional Units Inst Addr & Control Registers Data Cache Interface Instruction Reorder Buffer TLB Inst Cache Interface Inst Fetch Unit FP Functional Units Hot Chips VIII Presentation - Page 10

Performance At 180 MHz: 11.8 Spec Int 95 20.2 Spec FP 95 Currently in production Systems are shipping! Hot Chips VIII Presentation - Page 11

Performance Enablers Large number of functional units Aggressive Out-of-Order Execution Robust dependency tracking Large window of available instructions Explicit Hinting from Compiler Data Prefetch Branch Prediction High Performance Bus Interface Sustained superscalar operation Hot Chips VIII Presentation - Page 12

Effect of Instruction Reordering Efficiency (SPECint95 / MHz x 1000) 70 60 50 40 30 All examples are 4-way Superscalar designs except Pentium Pro which is 3-way Sun UltraSPARC (167 MHz) Digital 21164 (400 MHz) IBM PPC 620 (166 MHz) IBM PPC 604e (166 MHz) MIPS R10000 (200 MHz) Intel Pentium Pro (200 MHz) H-P PA-8000 (180 MHz) 20 0 10 20 30 40 50 60 Depth of Instruction Reorder Window Source: Microprocessor Report 4/15/96 Hot Chips VIII Presentation - Page 13

Instruction Reorder Buffer 56 entries, split into ALU/FP IRB and MEM IRB Reorders instructions on the fly Tracks all dependencies between instructions Tracks branch prediction status Capable of flash invalidating all instructions that were incorrectly fetched. Consists of 850K transistors and consumes 20% of Die Area Hot Chips VIII Presentation - Page 14

Block Diagram of IRB 4 inst SORT 4 inst 4 inst ALUs SMUs MACs DIVs ALU IRB MEM IRB LOAD STORE UNITS 2 inst 2 inst RETIRE Hot Chips VIII Presentation - Page 15

Instruction Insertion In Order Fetch any mix of four instructions/cycle Routed to appropriate portion of IRB Branches stored in both ALU and MEM IRB Instructions with two targets (such as LDWM) split into two parts Hot Chips VIII Presentation - Page 16

Instruction Launch Out of Order Oldest even and oldest odd instruction from each segment of IRB with all dependencies cleared is allowed to execute 4 instructions maximum Results stored in associated rename register for each entry Hot Chips VIII Presentation - Page 17

Instruction Retire In Order Up to two ALU/FP instructions and two MEM instructions each cycle Results moved from RRs to GRs/PSW Allows for precise exceptions Hot Chips VIII Presentation - Page 18

Dependency Tracking All possible instruction dependencies are identified at INSERT time. Operand Carry Borrow (CB) Shift Amount Register (SAR) Control (CTL) Nullify Address Handled by separate ARB unit that maintains state information about pending loads and stores. Many others... Hot Chips VIII Presentation - Page 19

Operand Dependencies Occur when source data of one instruction is the result of an earlier instruction. Most Recent Writer of Source data determined at insert time utilizing a two-pass mechanism. High Performance Broadcast mechanism. Upon launch, an IRB entry broadcasts its slot number to all other entries in the IRB. If a later instruction's source tag matches that driven on the launch bus, the dependency has cleared. Dependent instructions can launch very next cycle after a producer instruction executes. The IRB also sends information to the functional units about where its source data should come from (RRs, bypass, etc.) and where the results should be stored. Hot Chips VIII Presentation - Page 20

Carry Borrow Dependencies Occur when an instruction uses CB bits of the Processor Status Word. Most recent IRB entry passes information to incoming instructions regarding whether there is an instruction prior to it that sets CB bits. An instruction is aware it has a dependency, but does not know which instruction it is dependent on until its dependency has been cleared. Complex control Hot Chips VIII Presentation - Page 21

CB Dependencies (cont) Propagation System Tags travel up to two IRB entries/cycle Each IRB entry can: Block tag bus - if instruction writes CB bits and has not executed yet Drive Tag Bus - when an instruction that writes CB bits launches Pass tags from previous entry - if instruction does not write CB bits Trade-off Increased Latency for Area Savings In common case where an instruction that uses CB information immediately follows the setting instruction, there is no performance impact. Hot Chips VIII Presentation - Page 22

Conclusion The delivers high performance by: Aggressive Out-of-Order Execution Intelligent design choices Effective balancing of hardware to prevent bottlenecks Hot Chips VIII Presentation - Page 23

Acknowledgement The author would like to recognize the contributions of the entire processor design team from HP's Engineering Systems Lab in Fort Collins, Colorado. Hot Chips VIII Presentation - Page 24