The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u CMOS6, 6 metal layers, 2.2V 15 Million transistors, 3.1 cm 2, 587 pin PGA Specint95 of 30+ and Specfp95 of 50+ Out-of-order and speculative execution 4-way integer issue 2-way floating-point issue Sophisticated tournament branch prediction High-bandwidth memory system (1+ GB/sec) 2 Page 1 1

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line L1 Ins. (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle (15) (72) ADD Div/Sqrt MUL L1 Data Victim Buffer Miss Bus erface Sys Bus 64-bit Bus 128-bit Phys 44-bit 3 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line L1 Ins. (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle (15) (72) ADD Div/Sqrt MUL L1 Data Victim Buffer Miss Bus erface Sys Bus 64-bit Bus 128-bit Phys 44-bit 4 Page 2 2

21264 Instruction Fetch Bandwidth Enablers The 64 KB two-way associative instruction cache supplies four instructions every cycle The next-fetch and set predictors provide the fast cache access times of a direct-mapped cache and eliminate bubbles in nonsequential control flows The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions The tournament branch predictor dynamically selects between Local and Global history to minimize mispredicts 5 Instruction Stream Improvements Instr Decode, Branch Pred PC... PC GEN Learns Dynamic Jumps 0-cycle Penalty On Branches Set-Associative Behavior Tag S0 Tag S1 Icache Data Next Fetch Set Pred Cmp Cmp Hit/Miss/Set Miss Instructions (4) Next address + set 6 Page 3 3

Fetch Stage: Branch Prediction Some branch directions can be predicted based on their past behavior: Local Correlation Others can be predicted based on how the program arrived at the branch: Global Correlation if (a%2 == 0) TNTN if (a == 0) (predict taken) if (b == 0) NNNN if (b == 0) (predict taken) if (a == b) (predict taken) 7 Tournament Branch Prediction Local History (1024x10) Local Prediction (1024x3) Global Prediction (4096x2) Program Counter Choice Prediction (4096x2) Global History Prediction 8 Page 4 4

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line L1 Ins. (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle (15) (72) ADD Div/Sqrt MUL L1 Data Victim Buffer Miss Bus erface Sys Bus 64-bit Bus 128-bit Phys 44-bit 9 per and Stages per: Rename 4 instructions per cycle (8 source / 4 dest ) 80 integer + 72 floating-point physical registers Stage: eger: 20 entries / Quad- Floating Point: 15 entries / Dual- Instructions issued out-of-order when data ready Prioritized from oldest to youngest each cycle Instructions leave the queues after they issue The queue collapses every cycle as needed 10 Page 5 5

and Stages MAPPER ISSUE QUEUE Saved State Arbiter Request Grant ister Numbers MAP CAMs Renamed isters ister Scoreboard Instructions 80 Entries 4 Inst / Cycle QUEUE 20 Entries d Instructions 11 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line L1 Ins. (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle (15) (72) ADD Div/Sqrt MUL L1 Data Victim Buffer Miss Bus erface Sys Bus 64-bit Bus 128-bit Phys 44-bit 12 Page 6 6

ister and ute Stages Floating Point ution s Mul Add Div SQRT m. video shift/br add / logic add / logic / memory eger ution s Cluster 0 Cluster 1 mul shift/br add / logic add / logic / memory 13 eger Cross-Cluster Instruction Scheduling and ution eger ution s Instructions are statically pre-slotted to the upper or lower execution pipes The issue queue dynamically selects between the left and right clusters This has most of the performance of 4-way with the simplicity of 2-way issue m. video shift/br add / logic Cluster 0 Cluster 1 add / logic / memory mul shift/br add / logic add / logic / memory 14 Page 7 7

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Next-Line L1 Ins. (20) 80 in-flight instructions plus 32 loads and 32 stores 4 Instructions / cycle (15) (72) ADD Div/Sqrt MUL L1 Data Victim Buffer Miss Bus erface Sys Bus 64-bit Bus 128-bit Phys 44-bit 15 21264 On-Chip Memory System Features Two loads/stores per cycle Any combination 64 KB two-way associative L1 data cache (9.6 GB/sec) Phase-pipelined at > 1 GHz (no bank conflicts!) 3 cycle latency (issue to issue of consumer) with hit prediction Out-of-order and speculative execution Minimizes effective memory latency 32 outstanding loads and 32 outstanding stores Maximizes memory system parallelism Speculative stores forward data to subsequent loads 16 Page 8 8

Memory System (Continued) 64L2 @ 1.5x System @ 1.5x Bus erface Data Path Fill Data Cluster 0 Memory Data Busses Cluster 1 Memory Speculative Store Data 128 64 64 Path New references check addresses against existing references Speculative store data that address matches bypasses to loads Multiple misses to the same cache block merge 128 Double-Pumped GHz Data Hazard detection logic manages out-of-order references 128 Victim Data 17 Low-latency Speculative of eger Load Data Consumers (Predict Hit) LDQ R0, 0(R0) ADDQ R0, R0, R1 OP R2, R2, R3 LDQ Lookup Cycle Q R E D Q R E D Q R E D 3 Cycle Latency Q R E D ADDQ Cycle Squash this ADDQ on a miss This op also gets squashed Restart ops here on a miss When predicting a load hit: The ADDQ issues (speculatively) after 3 cycles Best performance if the load actually hits (matching the prediction) The ADDQ issues before the load hit/miss calculation is known If the LDQ misses when predicted to hit: Squash two cycles (replay the ADDQ and its consumers) Force a mini-replay (direct from the issue queue) 18 Page 9 9

Low-latency Speculative of eger Load Data Consumers (Predict Miss) LDQ R0, 0(R0) OP1 R2, R2, R3 OP2 R4, R4, R5 ADDQ R0, R0, R1 LDQ Lookup Cycle Q R E D Q R E D Q R E D Q R E D ADDQ can issue 5 Cycle Latency (Min) Earliest ADDQ When predicting a load miss: The minimum load latency is 5 cycles (more on a miss) There are no squashes Best performance if the load actually misses (as predicted) The hit/miss predictor: MSB of 4-bit counter (hits increment by 1, misses decrement by 2) 19 Dynamic Hazard Avoidance (Before Marking) Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to the LDQ R1, 0(R29) same memory address! First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) STQ R0, 0(R28) This (re-ordered) load got the wrong data value! 20 Page 10 10

Dynamic Hazard Avoidance (After Marking/Training) This load is marked this time Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to the LDQ R1, 0(R29) same memory address! Subsequent executions (after marking): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29) The marked (delayed) load gets the bypassed store data! 21 New Memory Prefetches Software-directed prefetches Prefetch Prefetch w/ modify intent Prefetch, evict it next Evict Block (eject data from the cache) Write hint allocate block with no data read useful for full cache block writes 22 Page 11 11

21264 Off-Chip Memory System Features 8 outstanding block fills + 8 victims Split L2 cache and system busses (back-side cache) High-speed (bandwidth) point-to-point channels clock-forwarding technology, low pin counts L2 hit load latency (load issue to consumer issue) = 6 cycles + SRAM latency Max L2 cache bandwidth of 16 bytes per 1.5 cycles 6.4 GB/sec with a 400Mhz transfer rate L2 miss load latency can be 160ns (60 ns DRAM) Max system bandwidth of 8 bytes per 1.5 cycles 3.2 GB/sec with a 400Mhz transfer rate 23 21264 Pin Bus System Pin Bus L2 Port System Chip-set (DRAM and I/O) Out In System Data 64b Alpha 21264 TAG Data 128b L2 TAG RAMs L2 Data RAMs Independent data busses allow simultaneous data transfers L2 : 0-16MB Synch Direct ped Peak Bandwidth: Example 1: - 133 MHz Burst RAM 2.1 GB/Sec Example 2: RISC 200+ MHz Late Write 3.2+ GB/Sec Example 3: Dual Data 400+ MHz 6.4+ GB/Sec 24 Page 12 12

Compaq Alpha / AMD Shared System Pin Bus (External erface) System Pin Bus ( Slot A ) System Chip-set (DRAM and I/O) Out In System Data 64b AMD K7 High-performance system pin bus Shared system chipset designs This is a win-win! 25 Dual 21264 System Example 21264 & L2 64b 8 Data Switches 256b 256b Memory Bank 64b 21264 & L2 Control Chip 100MHz SDRAM Memory Bank PCI-chip PCI-chip 64b PCI 64b PCI 100MHz SDRAM 26 Page 13 13

Measured Performance: SPEC95 Spec95 35 30 30 25 20 15 18.8 17.4 16.5 16.1 14.7 14.4 10 5 0 21264-600 21164-600 PA8200 - Pentium II - UltraSparc R10000 - PowerPC 240 400 II - 360 250 604e - 332 Spec95 60 50 50 40 30 29.2 28.5 26.6 24.5 23.5 20 13.7 12.6 10 0 21264-21164 - PA8200 - POWER2 R10000 - UltraSparc Pentium II PowerPC 600 600 240-160 250 II - 360-400 604e - 332 27 Measured Performance: STREAMS Max Streams (MB/sec) 1200 1000 800 1000 883 600 400 200 367 358 292 213 0 21264-600 RS6000-397 UltraSparc II - R10000-250 UE6000 (ass.) - O2000 21164-533 - 164SX Pentium II - 300 - Dell 28 Page 14 14

Alpha 21264 Floating-point Floating per and Instruction Data path Bus erface Memory Controller eger (Left) eger per eger Data and Control Busses Memory Controller eger (right) Instruction BIU Data 29 Summary The 21264 will maintain Alpha s performance lead 30+ Specint95 and 50+ Specfp95 1+ GB/sec memory bandwidth The 21264 proves that both high frequency and sophisticated architectural features can coexist high-bandwidth speculative instruction fetch out-of-order execution 6-way instruction issue highly parallel out-of-order memory system 30 Page 15 15