A 1.5GHz Third Generation Itanium Processor Jason Stinson, Stefan Rusu Intel Corporation, Santa Clara, CA 1
Outline Processor highlights Process technology details Itanium processor evolution Block diagram Cache circuit design details Package details Front-side bus interface Clock distribution Power dissipation RAS, DFT and DFM features Frequency shmoo Summary 2
130nm Itanium 2 Processor Highlights 410M transistors 374mm 2 die size 6MB on-die L3 cache 1.5GHz at 1.3V 6.4GB/s 400MT/s 4-way bus interface Plug-in compatible with existing platforms Extensive RAS, DFT and DFM features Largest microprocessor transistor count and on-die cache 3
130nm Process Characteristics Attribute Lgate M1 pitch M2 pitch M3 pitch M4 pitch M5 pitch M6 pitch Dielectric Memory cell Value 60nm 350nm 448nm 448nm 756nm 1120nm 1204nm FSG, K=3.6 2.45µm 2 S. Thompson et al, IEDM 2001 (top) S. Tyagi, et al, IEDM 2000 (bottom) 4
Itanium Processor Evolution Attribute Architecture Process Device Count On-die L3 cache Frequency Itanium Processor Explicitly Parallel Instruction Computing 180nm 25M 0* 800MHz Itanium 2 Processor 180nm 221M 3MB 1.0GHz This work 130nm 410M 6MB 1.5GHz Supply Voltage 1.6V 1.5V 1.3V Power 130W* 130W 130W * Includes 4MB cache on cartridge 5
Block Diagram L3 Cache ECC ECC ECC L2 Cache Quad Port ECC Branch Prediction 11 Issue Ports Scoreboard, Predicate, NATs, Exceptions L1 Instruction Cache and Fetch/Pre-fetch Engine Instruction Queue ITLB B B B M M M M I I F F Register Stack Engine / Re-Mapping Branch & Predicate Registers Branch Units IA-32 Decode and Control 128 Integer Registers 128 FP Registers Integer and MM Units 8 bundles Quad-Port L1 Data Cache and DTLB ALAT Floating Point Units ECC ECC Bus (128b data, 6.4GB/s @400MT/s) 6
Cache Summary Attribute L1I L1D L2 L3 Size 16K 16K 256K 6M Line Size 64B 64B 128B 128B Ways 4 4 8 24 Replacement LRU NRU NRU NRU Latency 1-Fetch:1 INT:1 FP: NA INT: 5 FP: 6 14 Write Policy - WT (RA) WB (WA) WB (WA) Bandwidth R: 48GBs R: 24GBs W: 24GBs R: 48GBs W: 48GBs R: 48GBs W: 48GBs Cache bandwidth increased by 50% L3 cache latency increased to 14 clocks and set associativity doubled 7
L1 Instruction Cache Circuit Detail p0wl p1wl p1_local_bl p0_local_bl p1_global_bl p0_global_bl p1_write_data write_en# fill_cycle_1 p0_write_data coreclock fill_cycle_1 write_en# p1_local_bl write-1 read-2 read-3 p0_local_bl read-1 write-1 8
L1 Instruction Cache Circuit Detail p0wl p1wl p1_local_bl p0_local_bl p1_global_bl p0_global_bl p1_write_data write_en# fill_cycle_1 p0_write_data coreclock fill_cycle_1 write_en# p1_local_bl write-1 read-2 read-3 p0_local_bl read-1 write-1 9
L1 Instruction Cache Circuit Detail p0wl p1wl p1_local_bl p0_local_bl p1_global_bl p0_global_bl p1_write_data write_en# fill_cycle_1 p0_write_data coreclock fill_cycle_1 write_en# p1_local_bl write-1 read-2 read-3 p0_local_bl read-1 write-1 10
Way[11:0] BLOCK7 BLOCK6 Way[23:12] BLOCK7 BLOCK6 Decoders Sense amplifiers Timers L3 Subarray BLOCK5 BLOCK5 Core BLOCK4 BLOCK4 1588u BLOCK3 I/O mux BLOCK3 L3 Cache BLOCK2 BLOCK1 BLOCK0 BLOCK2 BLOCK1 BLOCK0 L3 cache contains 140 subarrays tiled to fit irregular shape of core 776u 11
Package Details Power delivery connector Server Management Components Flip-Chip BGA package with Integrated Heat Spreader Interposer Substrate 12
Package Decoupling Vdd (core) Vtt (FSB) PLL filter Power delivery connector 13
Front Side Bus P1 P3 CS P2 P4 Interface Support System Topology Termination Voltage Voltage Reference Data Bus Width Data Bus Speed Data Strobes Peak BW Address, Control Speed Glueless 4-way Multi-Processor Dual-sided board, staggered vias 1.2V, common ground with core Ground-referenced, 0.75V Vref 128-bit 400MT/s source synchronous 1 differential strobe for 16b of data 6.4GB/s 200MHz common clock 14
Front-Side Bus Topology Previous implementation Four linear stripes This work U-shape Core Core L3 Cache L3 Cache Data I/O Address I/O Control I/O 15
Itanium Processor Clock Distribution Trends Attribute Itanium Processor Itanium 2 Processor This work Process/Metal 180nm / Al 180nm / Al 130nm / Cu Primary tree Single ended Differential Differential Local distribution Grid Tree Tree Clock skew [ps] 28ps 62ps 24ps Deskew Method Active On-demand Fuse-based 16
Fuse-Based Clock Deskew ITC (TAP) SLCB scan chain SLCB Ck Gen Unit PD SLCB SLCBO zones FUSE Unit SLCB FUSE settings SLCB (3-bit/SLCB) SLCB = Second Level Clock Buffer 17
Skew (ps) 30 20 10 0-10 -20-30 -40-50 -60-70 Clock Zone Skew Plot No deskew Fuse-based deskew Scan-based deskew 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Clock Zone Worst case clock skew is 24ps in fuse mode and 7ps in scan mode 18
Power Same 130W power envelope as the 0.18um Itanium 2 processor 50% frequency increase 2X larger L3 cache Leakage increased 3.5X 5% 14% 7% 74% This work dynamic power I/O power core leakage/static cache leakage Aggressive management of dynamic power Reduced clock loading Reduced contention power L3 cache power management 5% 5% 90% Itanium 2 Processor dynamic power I/O power all leakage/static 19
L3D Power Reduction Scheme Previous Implementation This work Index [4:3] Index [4:3] == 11 Bank 1 Request Index [4:3] == 10 Bank 0 Request Index [4:3] == 01 Index [4:3] == 00 8 8 8 8 8 8 8 8 Bank 1 Request Bank 0 Request 8 8 8 8 8 8 8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Data in [7:0] Data out [7:0] 2 Data in [7:0] 8 8 8 8 8 8 8 8 8 8 Data out [7:0] 8 8 20
Reliability Features Core Pipeline L1I (16KB) L2 (256KB) Data L3 (6MB) Data + Tag L1D (16KB) Tag Bus Unit Error Log Timer ECC protected ITLB1 (32) DTLB1 (32) IPC (128) DTLB2 (128) HW Page Walker Bus Queues Data Poisoning Data Bus Addr Bus Parity protected System bus 21
DFT/DFM Feature Summary Feature Itanium Processor Itanium 2 Processor This work Scan Coverage 48K 140K 140K Scanout Coverage 5.5K 24K 24K Cache DAT Mode (major arrays) Yes Yes Yes L3 Redundancy / Repair N/A Dual Quad Weak-Write Test Mode Fixed Fixed Programmable IO DFT Basic IO Loopback Limited IO Loopback Enhanced IO Loopback Dynamic Frequency Adjustment Multi-cycle shrink/stretch Single cycle shrink/stretch Multi-cycle shrink/stretch On-die process monitors No No Yes 22
Thermal Protection Features Temperature Thermal Alert ETM Thermal Trip I res I diode ETM Throttle Thermal Trip Time Diode Reference 23
Frequency Shmoo PASS FAIL 1:9 Bus Period (ns) Core Frequency 1.80GHz 1.73GHz 1.66GHz 1.61GHz 1.55GHz 1.50GHz 1.45GHz 1.40GHz Frequency increased by 50% from previous generation 24
Summary The 2003 version of the Itanium 2 processor (Madison) delivers 2X larger on-die cache and 50% increase in frequency Compatible with today s Itanium 2-based systems Enterprise-class RAS, DFT and DFM features Largest on-die cache and transistor count ever reported for a microprocessor 6.4GB/s multi-drop bus interface Clock de-skew technique achieves 24ps skew in fuse-mode and 7ps in scan mode across entire die 25