Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Similar documents
HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Jim Keller. Digital Equipment Corp. Hudson MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Itanium 2 Processor Microarchitecture Overview

PowerPC 740 and 750

Superscalar Machines. Characteristics of superscalar processors

Superscalar Processors

Four-Way Superscalar PA-RISC Processors

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

XT Node Architecture

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Case Study IBM PowerPC 620

Limitations of Scalar Pipelines

1. PowerPC 970MP Overview

Microarchitecture Overview. Performance

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Preventing Stalls: 1

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Intel 8086 MICROPROCESSOR. By Y V S Murthy

Caches. Hiding Memory Access Times

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CS146 Computer Architecture. Fall Midterm Exam

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

EE 457 Unit 7b. Main Memory Organization

Multithreaded Processors. Department of Electrical Engineering Stanford University

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

s complement 1-bit Booth s 2-bit Booth s

Superscalar Processor

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

MaanavaN.Com CS1202 COMPUTER ARCHITECHTURE

Hardware-based Speculation

The Nios II Family of Configurable Soft-core Processors

Microarchitecture Overview. Performance

Good luck and have fun!

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Computer Architecture EE 4720 Final Examination

Hardware-Based Speculation

CS152 Computer Architecture and Engineering. Complex Pipelines

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Adapted from David Patterson s slides on graduate computer architecture

Handout 4 Memory Hierarchy

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Power 7. Dan Christiani Kyle Wieschowski

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Superscalar Processor Design

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CS 152, Spring 2011 Section 8

The RSIM Simulation Environment

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Portland State University ECE 588/688. Cray-1 and Cray T3E

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

EECS 322 Computer Architecture Superpipline and the Cache

Memory Hierarchy Basics

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Intel 8086 MICROPROCESSOR ARCHITECTURE

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

The ARM10 Family of Advanced Microprocessor Cores

Advanced cache optimizations. ECE 154B Dmitri Strukov

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Digital Leads the Pack with 21164

E0-243: Computer Architecture

Computer Architecture Spring 2016

ECE468 Computer Organization and Architecture. Virtual Memory

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

ECE4680 Computer Organization and Architecture. Virtual Memory

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Fundamental CUDA Optimization. NVIDIA Corporation

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

EC 513 Computer Architecture

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Multiple Instruction Issue. Superscalars

CS 2410 Mid term (fall 2018)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 5: MEMORY HIERARCHY DESIGN

CS425 Computer Systems Architecture

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

A Cache Hierarchy in a Computer System

Transcription:

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company

Presentation Overview PA-8500 Overview uction Fetch Capabilities Reorder Buffers ( The Queue ) Data Cache System Bus

PA-8500 D-Cache (0.5 MB) D-tag D-tag D-Cache (0.5 MB) Cache DP Int DP IRB ARB TLB Bus Cntl IF I-tag I-Cache (0.5 MB) FP Runway Bus I/O

PA-8500 Processor Core Inst. Cache uction Fetch Unit BHT, BTAC Dual 64-bit Integer ALUs Sort TLB System Bus Interface Runway bus Dual Shift/ Merge Units ALU Buffer 28 entries Memory Buffer 28 entries Dual Load/Store Adders Reorder Buffer 28 entries Data Cache Dual FP Multiply/ Accumulate Units Dual FP Divide/ SQRT Units Rename Registers Retire Architected Registers Rename Registers

Memory Latency 1000 Speed (MHz) CPU Latency Problems uction Fetches & Loads 100 10 DRAM Techniques for Hiding Latency High hit-rate caches Prefetching Overlapping cache misses 1990 1995 2000 Year

uction Fetch Features uction Cache 0.5 MB on-chip cache 4-way set associative Pipelined 2-cycle access Provides 4 instructions per cycle to CPU core Supports 32-byte and 64-byte line sizes uction Prefetching

PA-8500 I-Cache Composition 4 uctions per cycle to Queue from a 0.5 MB cache uction Reorder Buffer ( Queue ) I-Fetch mux = TAGS I-Cache RAM I-Cache RAM I-Cache RAM I-Cache RAM

PA-8500 uction Prefetching 1. I-Miss from cache 2. I-Miss issued to Runway Bus 3. I-Prefetch issued to Runway Bus 4. I-Miss Return inserted into cache 5. I-Prefetch Return held in Prefetch Buffer 6. I-Miss from Cache causes Prefetch Buffer Hit 7. I-Miss moved from Prefetch Buffer to Cache 8. I-Prefetch issued to Runway Bus (next line) uction Reorder Buffer ( Queue ) uction Fetch Unit System Bus Interface Prefetch Buffer uction Cache Tags Runway Bus

Reorder Buffers System Bus Interface Runway Bus From I-Fetch uction Reorder Buffer ( Queue ) (56 entries) Load/Store Adder Load/Store Adder Reorder Buffer (28 entries) Data Cache (1MB) Cycle by cycle progression of a load instruction Insert Launch Cache Cache RR Retire

LOAD-MISS Overlapping LOAD-MISS Use LOAD-MISS The Problem time Use LOAD-MISS Use LOAD-MISS LOAD-MISS LOAD-MISS Use Use Use PA-8500 Solution

Reorder Buffer: High-Speed Custom Circuitry Cache port arbitration circuits miss-grantl grantl requests miss-grantl grantl requests Insert es matches matches 28 Entries Launch es Miss es from ALU to Cache 0-catcher to Runway 0-catcher

Data Prefetching LOAD-to-GR0 time LOAD-HIT The Problem Avoid the LOAD-MISS latency Solution Compiler inserts Prefetch instruction (LOAD to GR0) Independent instructions executed ( ) Data is resident in cache for LOAD (LOAD-HIT)

Data Cache Features 1.0 MB on-chip cache 4-way set associative 2-cycle pipelined access Two accesses per cycle Supports 32-byte and 64-byte line sizes Sophisticated Store Queue

Data Cache EVEN DATA RAM Store Queue Integer Data Path TAG RAM TAG RAM ODD DATA RAM = = Store Queue System Interface TLB

Single-Level vs. Multi-Level Cache Designs 1.5 MB @ 2 cycles Core Core 1 MB Data Cache 0.5 MB Cache L1 Data L1 Off-chip L2 32 KB (each) @ 1 cycle 4 MB @ ~15 cycles System Bus System Bus

System Bus Interface Split-transaction bus with out-of-order returns Multiple transactions in flight simultaneously Priority given to latency-sensitive transactions Asynchronous Interface Turbo Mode

Turbo Mode Clock Data High-Speed Data Transfer between Memory and CPU

Mitigating Memory Latency Effects Large Caches Out-of-Order Queue Flexible System Interface Custom Circuit Design The PA-8500 Achieves Superb Performance!