The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Similar documents
The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Jim Keller. Digital Equipment Corp. Hudson MA

The Alpha Microprocessor Architecture. Compaq Computer Corporation 334 South St., Shrewsbury, MA

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Itanium 2 Processor Microarchitecture Overview

The Alpha and Microprocessors:

Exploring Alpha Power for Technical Computing

CS 152, Spring 2011 Section 8

Advanced cache optimizations. ECE 154B Dmitri Strukov

EECS 322 Computer Architecture Superpipline and the Cache

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

November 7, 2014 Prediction

Memory Hierarchies 2009 DAT105

OOO Execution and 21264

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Digital Leads the Pack with 21164

The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Chapter-5 Memory Hierarchy Design

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Intel released new technology call P6P

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Dynamic Memory Dependence Predication

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

E0-243: Computer Architecture

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

CS 152, Spring 2012 Section 8

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Dynamic Control Hazard Avoidance

Instruction Level Parallelism

Next Generation Technology from Intel Intel Pentium 4 Processor

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Mainstream Computer System Components

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

The Pentium II/III Processor Compiler on a Chip

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

1. PowerPC 970MP Overview

EITF20: Computer Architecture Part4.1.1: Cache - 2

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Advanced Processor Architecture

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Design Objectives of the 0.35µm Alpha Microprocessor (A 500MHz Quad Issue RISC Microprocessor)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Memory latency: Affects cache miss penalty. Measured by:

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

Digital Sets New Standard

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

MICROPROCESSOR. R. E. Kessler Compaq Computer Corporation

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

EECS 470 Final Exam Fall 2013

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Static Branch Prediction

Keywords and Review Questions

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

PowerPC 740 and 750

UltraSparc-3 Aims at MP Servers

Processor: Superscalars Dynamic Scheduling

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Dynamic Branch Prediction

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Computer System Components

Superscalar Processors

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Page 1. Multilevel Memories (Improving performance using a little cash )

Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Lec 11 How to improve cache performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

COSC 6385 Computer Architecture - Pipelining

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Transcription:

The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1

Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in 0.35u CMOS6, 6 metal layers, 2.2V y 15 Million transistors, 3.1 cm 2, 587 pin PGA y Specint95 of 30+ and Specfp95 of 50+ y Out-of-order and speculative execution y 4-way integer issue y 2-way floating-point issue y Sophisticated tournament branch prediction y High-bandwidth memory system (1+ GB/sec) 2

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line Address L1 Ins. Cache 64KB 2-Set Int Map Int Issue Queue (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (80) File (72) Addr Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit 3

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line Address L1 Ins. Cache 64KB 2-Set Int Map Int Issue Queue (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (80) File (72) Addr Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit 4

21264 Instruction Fetch Bandwidth Enablers z The 64 KB two-way associative instruction cache supplies four instructions every cycle z The next-fetch and set predictors provide the fast cache access times of a direct-mapped cache and eliminate bubbles in non-sequential control flows z The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions z The tournament branch predictor dynamically selects between Local and Global history to minimize mispredicts 5

Instruction Stream Improvements Instr Decode, Branch Pred PC... PC GEN Learns Dynamic Jumps 0-cycle Penalty On Branches Set-Associative Behavior Tag S0 Tag S1 Icache Data Line Addr Set Pred Cmp Cmp Hit/Miss/Set Miss Instructions (4) Next LP address + set 6

Fetch Stage: Branch Prediction z Some branch directions can be predicted based on their past behavior: Local Correlation z Others can be predicted based on how the program arrived at the branch: Global Correlation if (a%2 == 0) TNTN if (a == 0) (predict taken) if (b == 0) NNNN if (b == 0) (predict taken) if (a == b) (predict taken) 7

Tournament Branch Prediction Program Counter Local History (1024x10) Local Prediction (1024x3) Global Prediction (4096x2) Choice Prediction (4096x2) Path History Prediction 8

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line Address L1 Ins. Cache 64KB 2-Set Int Map Int Issue Queue (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (80) File (72) Addr Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit 9

Mapper and Queue Stages z Mapper: y Rename 4 instructions per cycle (8 source / 4 dest ) y 80 integer + 72 floating-point physical registers z Queue Stage: y Integer: 20 entries / Quad-Issue y Floating Point: 15 entries / Dual-Issue y Instructions issued out-of-order when data ready y Prioritized from oldest to youngest each cycle y Instructions leave the queues after they issue y The queue collapses every cycle as needed 10

Map and Queue Stages MAPPER ISSUE QUEUE Saved Map State Arbiter Request Grant ister Numbers MAP CAMs Renamed isters ister Scoreboard Instructions 80 Entries 4 Inst / Cycle QUEUE 20 Entries Issued Instructions 11

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line Address L1 Ins. Cache 64KB 2-Set Int Map Int Issue Queue (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (80) File (72) Addr Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit 12

ister and ute Stages Floating Point ution Units FP Mul Integer ution Units Cluster 0 Cluster 1 m. video mul shift/br shift/br add / logic add / logic FP Add FP Div SQRT add / logic / memory add / logic / memory 13

Integer Cross-Cluster Instruction Scheduling and ution Integer ution Units Instructions are statically pre-slotted to the upper or lower execution pipes The issue queue dynamically selects between the left and right clusters This has most of the performance of 4-way with the simplicity of 2-way issue m. video shift/br add / logic Cluster 0 Cluster 1 add / logic / memory mul shift/br add / logic add / logic / memory 14

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line Address L1 Ins. Cache 64KB 2-Set Int Map Int Issue Queue (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle FP Map FP Issue Queue (15) File (80) File (80) File (72) Addr Addr FP ADD Div/Sqrt FP MUL L1 Data Cache 64KB 2-Set Victim Buffer Miss Address Bus Interface Unit Sys Bus 64-bit Cache Bus 128-bit Phys Addr 44-bit 15

21264 On-Chip Memory System Features z Two loads/stores per cycle y any combination z 64 KB two-way associative L1 data cache (9.6 GB/sec) y phase-pipelined at > 1 Ghz (no bank conflicts!) y 3 cycle latency (issue to issue of consumer) with hit prediction z Out-of-order and speculative execution y minimizes effective memory latency z 32 outstanding loads and 32 outstanding stores y maximizes memory system parallelism z Speculative stores forward data to subsequent loads 16

Memory System (Contd.) 64L2 @ 1.5x System @ 1.5x Bus Interface Data Path Cluster 0 Memory Cluster Unit 1 Memory Unit 64 64 Speculative Store Data 128 Address Path New references check addresses against existing references Speculative store data that address matches bypasses to loads Multiple misses to the same cache block merge Hazard detection logic manages Double-Pumped Ghz Data Cache out-of-order references 128 128 17

Low-latency Speculative Issue of Integer Load Data Consumers (Predict Hit) LDQ R0, 0(R0) ADDQ R0, R0, R1 OP R2, R2, R3 LDQ Cache Lookup Cycle Q R E D Q R E D Q R E D 3 Cycle Latency Q R E D ADDQ Issue Cycle Squash this ADDQ on a miss This op also gets squashed Restart ops here on a miss z When predicting a load hit: y The ADDQ issues (speculatively) after 3 cycles y Best performance if the load actually hits (matching the prediction) y The ADDQ issues before the load hit/miss calculation is known z If the LDQ misses when predicted to hit: y Squash two cycles (replay the ADDQ and its consumers) y Force a mini-replay (direct from the issue queue) 18

Low-latency Speculative Issue of Integer Load Data Consumers (Predict Miss) LDQ R0, 0(R0) OP1 R2, R2, R3 OP2 R4, R4, R5 ADDQ R0, R0, R1 LDQ Cache Lookup Cycle Q R E D Q R E D Q R E D Q R E D ADDQ can issue 5 Cycle Latency (Min) Earliest ADDQ Issue z When predicting a load miss: y The minimum load latency is 5 cycles (more on a miss) y There are no squashes y Best performance if the load actually misses (as predicted) z The hit/miss predictor: y MSB of 4-bit counter (hits increment by 1, misses decrement by 2) 19

Dynamic Hazard Avoidance (Before Marking) Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to LDQ R1, 0(R29) the same memory address! First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) STQ R0, 0(R28) This (re-ordered) load got the wrong data value! 20

Dynamic Hazard Avoidance (After Marking/Training) This load is marked this time Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to LDQ R1, 0(R29) the same memory address! Subsequent executions (after marking): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29) The marked (delayed) load gets the bypassed store data! 21

New Memory Prefetches zsoftware-directed prefetches y Prefetch y Prefetch w/ modify intent y Prefetch, evict it next y Evict Block (eject data from the cache) y Write hint y allocate block with no data read y useful for full cache block writes 22

21264 Off-Chip Memory System Features z 8 outstanding block fills + 8 victims z Split L2 cache and system busses (back-side cache) z High-speed (bandwidth) point-to-point channels y clock-forwarding technology, low pin counts z L2 hit load latency (load issue to consumer issue) = 6 cycles + SRAM latency z Max L2 cache bandwidth of 16 bytes per 1.5 cycles y 6.4 GB/sec with a 400Mhz transfer rate z L2 miss load latency can be 160ns (60 ns DRAM) z Max system bandwidth of 8 bytes per 1.5 cycles y 3.2 GB/sec with a 400Mhz transfer rate 23

21264 Pin Bus System Pin Bus L2 Cache Port System Chip-set (DRAM and I/O) Address Out Address In System Data 64b Alpha 21264 TAG Address Data 128b L2 Cache TAG RAMs L2 Cache Data RAMs Independent data busses allow simultaneous data transfers L2 Cache: 0-16MB Synch Direct Mapped Peak Bandwidth: Example 1: - 133 MHz BurstRAM 2.1 GB/Sec Example 2: RISC 200+ MHz Late Write 3.2+ GB/Sec Example 3: Dual Data 400+ MHz 6.4+ GB/Sec 24

COMPAQ Alpha / AMD Shared System Pin Bus (External Interface) System Pin Bus ( Slot A ) System Chip-set (DRAM and I/O) Address Out Address In System Data 64b AMD K7 Cache High-performance system pin bus Shared system chipset designs This is a win-win! 25

Dual 21264 System Example 21264 & L2 Cache 64b 8 Data Switches 256b 256b Memory Bank 64b 21264 & L2 Cache Control Chip 100MHz SDRAM Memory Bank PCI-chip PCI-chip 64b PCI 64b PCI 100MHz SDRAM 26

Measured Performance: SPEC95 SpecInt95 35 30 30 25 20 15 18.8 17.4 16.5 16.1 14.7 14.4 10 5 0 21264-600 21164-600 PA8200-240 P entium II - 400 UltraSparc II - 360 R10000-250 PowerPC 604e - 332 SpecFP95 60 50 50 40 30 29.2 28.5 26.6 24.5 23.5 20 13.7 12.6 10 0 21264-600 21164-600 PA8200-240 POWER2-160 R10000-250 UltraSparc II - 360 Pentium II - 400 PowerPC 604e - 332 27

Measured Performance: STREAMS Max Streams (MB/sec) 1200 1000 800 1000 883 600 400 200 367 358 292 213 0 21264-600 (ass.) RS6000-397 UltraSparc II - UE6000 (ass.) R10000-250 - O2000 21164-533 - 164SX Pentium II - 300 - Dell 28

Alpha 21264 Floating-point Unit Floating Mapper and Queue Instruction Data path Bus Interface Unit Memory Controller Integer Unit (Left) Integer Mapper Integer Queue Data and Control Busses Memory Controller Integer Unit (right) Instruction Cache BIU Data Cache 29

Summary z The 21264 will maintain Alpha s performance lead y 30+ Specint95 and 50+ Specfp95 y 1+ GB/sec memory bandwidth z The 21264 proves that both high frequency and sophisticated architectural features can coexist y high-bandwidth speculative instruction fetch y out-of-order execution y 6-way instruction issue y highly parallel out-of-order memory system 30