Jim Keller. Digital Equipment Corp. Hudson MA

Similar documents
The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor Architecture. Compaq Computer Corporation 334 South St., Shrewsbury, MA

Itanium 2 Processor Microarchitecture Overview

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Design Objectives of the 0.35µm Alpha Microprocessor (A 500MHz Quad Issue RISC Microprocessor)

The Alpha and Microprocessors:

Digital Sets New Standard

Memory Hierarchies 2009 DAT105

Advanced cache optimizations. ECE 154B Dmitri Strukov

PowerPC 740 and 750

UltraSparc-3 Aims at MP Servers

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Digital Leads the Pack with 21164

CS 152, Spring 2011 Section 8

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

EECS 322 Computer Architecture Superpipline and the Cache

MIPS R5000 Microprocessor. Technical Backgrounder. 32 kb I-cache and 32 kb D-cache, each 2-way set associative

Exploring Alpha Power for Technical Computing

The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Microarchitecture Overview. Performance

Vector IRAM: A Microprocessor Architecture for Media Processing

Good luck and have fun!

XT Node Architecture

Next Generation Technology from Intel Intel Pentium 4 Processor

Superscalar Processors

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

Advanced Processor Architecture

Computer System Components

Memory latency: Affects cache miss penalty. Measured by:

How to Write Fast Numerical Code

Memory latency: Affects cache miss penalty. Measured by:

OOO Execution and 21264

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Microarchitecture Overview. Performance

COSC 6385 Computer Architecture - Pipelining

OPENSPARC T1 OVERVIEW

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Superscalar Machines. Characteristics of superscalar processors

Digital Semiconductor Alpha Microprocessor Product Brief

CS 152, Spring 2012 Section 8

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

620 Fills Out PowerPC Product Line

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Chapter-5 Memory Hierarchy Design

The ARM10 Family of Advanced Microprocessor Cores

1. PowerPC 970MP Overview

Dynamic Memory Dependence Predication

A 1.5GHz Third Generation Itanium Processor

Computer Architecture. Introduction. Lynn Choi Korea University

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

The Nios II Family of Configurable Soft-core Processors

POWER3: Next Generation 64-bit PowerPC Processor Design

This Material Was All Drawn From Intel Documents

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

COSC 6385 Computer Architecture - Memory Hierarchies (II)

E0-243: Computer Architecture

Keywords and Review Questions

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

CS152 Computer Architecture and Engineering. Complex Pipelines

Intel released new technology call P6P

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Case Study IBM PowerPC 620

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Portland State University ECE 587/687. Superscalar Issue Logic

How to write powerful parallel Applications

The Pentium II/III Processor Compiler on a Chip

Exploring the Effects of Hyperthreading on Scientific Applications

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

MIPS R4300I Microprocessor. Technical Backgrounder-Preliminary

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Transcription:

Jim Keller Digital Equipment Corp. Hudson MA

! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um

"## Continued Performance Leadership Est. SPECint95 of 30+ and SPECfp95 of 50+ Spectacular memory bandwidth Architectural, circuit and process leadership achieve 500 MHz operation in 0.35u CMOS Four-way Out-of-Order Execution Implements Alpha Motion Video Instructions MPEG2 encode in realtime

$ Feature Description Process Technology CMOS 6-6 layer metal Min. Feature Size 0.35 drawn Transistor Count 15.2 Million Frequency 500+ MHz Power 60 Watts Die Size 3.0 cm 2 Issue Rate 4 Integer 2 Floating Point L1 Instruction Cache 64KByte 2-way Set Associative L1 Data Cache 64KByte 2-way Set Associative Package 588 PGA Samples Q1 97 Volume H2 97

%&'#! FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Issue Exec Predictors Queu File Addr Map e (80) Exec (20) Next-Line Address L1 Ins. Cache 64KB 2-Set 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (72) Exec Exec Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit

(

%&'#! FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Issue Exec Predictors Queu File Addr Map e (80) Exec (20) Next-Line Address L1 Ins. Cache 64KB 2-Set 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (72) Exec Exec Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit

(# ( Four instructions plus a Next-fetch predictor are fetched per cycle from the instruction cache Pipelined branch predictor runs in parallel Next-fetch predictor addresses the cache next cycle Eliminates branch bubbles Learns dynamic jump targets for DLL s Set Predictor stores next-icache set (one of two) in current icache line Enables associative cache at high speed

(# (%&'#! PC... 0-cycle Penalty On Branches Set-Associative Behavior Tag S0 Tag S1 Instr. Cache Next- Fetch Pred Set Pred Cmp Cmp Hit/Miss/Set Miss Instructions (4) Next address + set

(# % Local Global Chooser Prediction Some branch directions can be predicted based on their past behavior: Local Correlation Others can be predicted based on how the program arrived at the branch: Global Correlation 21264 implements both methods and dynamically applies the best one

%&'#! FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Issue Exec Predictors Queu File Addr Map e (80) Exec (20) Next-Line Address L1 Ins. Cache 64KB 2-Set 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (72) Exec Exec Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit

)# Mapper: Rename 4 instructions per cycle (8 source / 4 dest ) 80 int + 72 fp physical registers Queue Stage: Instructions issued out-of-order when data ready Instructions leave the queues after they issue Integer: 20 entries / Quad-Issue Floating Point: 15 entries / Dual-Issue

)# MAPPER ISSUE QUEUE Saved Map State Arbiter Request Grant ister Numbers MAP CAMs Renamed isters ister Scoreboard Instructions 80 Entries 4 Inst / Cycle QUEUE 20 Entries Issued Instructions

%&'#! FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Issue Exec Predictors Queu File Addr Map e (80) Exec (20) Next-Line Address L1 Ins. Cache 64KB 2-Set 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (72) Exec Exec Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit

## Floating Point Execution Units Integer Execution Units FP Mul FP Add FP Div SQRT m. video shift/br add / logic add / logic / memory mul shift/br add / logic add / logic / memory

(## 21264 dual-issues floating-point instructions to two independent units Add / Div / Square Root unit 64-bit Add: 4-cycle latency, fully pipelined 64-bit IEEE Divide: 16 cycle latency 64-bit Square root: 33 cycle latency Multiply unit 4 cycle latency, fully pipelined

*# L1 Data Cache Two loads / stores per cycle (any combination) 1GHz operation: Phase pipelined - no bank conflict 16 Byte read / write per cycle Loads and stores issue out-of-order 32-entry load and store reorder buffers Stores check load buffer to enforce ordering Cache prefetch instructions Read, modify intent, non-cached, evict next

!+ Fill Data To Operand Bus Tag Store Addr DTB Store Data Dcache Set 0 Dcache Set 1 Victim Data Buffer Tag DTB To Operand Bus Fill Data Load Queue Store Queue Miss Addr File Victim Addr File

!(!++! L1 dcache: 8+ GB/s sustained bandwidth 128b datapath with 3 cycle load-to-use latency L2 cache: 4+ GB/s sustained bandwidth 128b datapath with 12 cycle load-to-use latency System interface: 2+ GB/s sustained bandwidth 64b datapath with 80 cycle load-to-use latency 16 64-byte off-chip memory references 8 read-misses and 8 write-backs

, ASIC Example Dupl TAG (opt) Control PCI Data Slice 2,4,8 System Port Address Out Address In System Data 64b Alpha 21264 L2 Cache Port TAG Address Data 128b L2 Cache TAG RAMs L2 Cache Data RAMs Frequency: 80-200 MHz Clock In I/O: 2V HSTL L2 Cache: 0-16MB Synch Direct Mapped Peak Bandwidth: Option 1: - 133 MHz BurstRAM 2.1GB/Sec Option 2: RISC 250MHz Late Write 4.0GB/Sec Option 3: Dual Data 333MHz 5.3GB/Sec

'+!! 21264 & L2 Cache 64b 8 Data Switches 256b 256b Memory Bank 64b 21264 & L2 Cache Control Chip 100MHz SDRAM Memory Bank PCI-chip PCI-chip 64b PCI 64b PCI 100MHz SDRAM

-+!! 21264 & L2 Cache 64b 2 or 4 Data Switches 128b 128b Memory Bank Control Chip Memory Bank PCI-chip 64b PCI

!$! $./ SPECint95 30 25 20 15 10 5 0 21264-500 est 21164-500 PA8000-180 P6-200 Ultra Sparc-200 R10000-200 SPECfp95 50 40 30 20 10 0 21264-500 est 21164-500 PA8000-180 P6-200 Ultra Sparc-200 R10000-200

!$! 0$1 McCalpin (MB/s) 1600 1200 800 400 0 21264-500 est 21164-500 PA8000-180 P6-200 Ultra Sparc-200 R10000-200

!!+ 21264 will maintain Alpha s performance lead 30+ SPECint95 and 50+ SPECfp95 Spectacular memory bandwidth Out-of-order execution can be done at very high frequency