A 1.5GHz Third Generation Itanium Processor

Similar documents
New 130nm Itanium 2 Processors for 2003

A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Itanium 2 Processor Microarchitecture Overview

Update for New Implementations. As new implementations of the Itanium architecture

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

2GB DDR3 SDRAM 72bit SO-DIMM

Jim Keller. Digital Equipment Corp. Hudson MA

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

OPENSPARC T1 OVERVIEW

Intel Enterprise Processors Technology

EECS 322 Computer Architecture Superpipline and the Cache

Intel released new technology call P6P

IMM128M72D1SOD8AG (Die Revision F) 1GByte (128M x 72 Bit)

Real Time Embedded Systems

POWER7: IBM's Next Generation Server Processor

KiloCore: A 32 nm 1000-Processor Array

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

IMM64M72D1SCS8AG (Die Revision D) 512MByte (64M x 72 Bit)

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

DDR SDRAM UDIMM. Draft 9/ 9/ MT18VDDT6472A 512MB 1 MT18VDDT12872A 1GB For component data sheets, refer to Micron s Web site:

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

DDR3 Memory Buffer: Buffer at the Heart of the LRDIMM Architecture. Paul Washkewicz Vice President Marketing, Inphi

Computer Science 146. Computer Architecture

POWER7: IBM's Next Generation Server Processor

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

IMM64M64D1SOD16AG (Die Revision D) 512MByte (64M x 64 Bit)

A 65nm LEVEL-1 CACHE FOR MOBILE APPLICATIONS

The University of Adelaide, School of Computer Science 13 September 2018

4GB Unbuffered VLP DDR3 SDRAM DIMM with SPD

1. NoCs: What s the point?

2GB DDR3 SDRAM SODIMM with SPD

IMM128M64D1DVD8AG (Die Revision F) 1GByte (128M x 64 Bit)

Mainstream Computer System Components

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

The Memory Hierarchy 1

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Organization Row Address Column Address Bank Address Auto Precharge 256Mx4 (1GB) based module A0-A13 A0-A9 BA0-BA2 A10

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Design Objectives of the 0.35µm Alpha Microprocessor (A 500MHz Quad Issue RISC Microprocessor)

POWER4 Test Chip. Bradley D. McCredie Senior Technical Staff Member IBM Server Group, Austin. August 14, 1999

Memory latency: Affects cache miss penalty. Measured by:

Organization Row Address Column Address Bank Address Auto Precharge 128Mx8 (1GB) based module A0-A13 A0-A9 BA0-BA2 A10

Memory latency: Affects cache miss penalty. Measured by:

ECE 571 Advanced Microprocessor-Based Design Lecture 24

M2U1G64DS8HB1G and M2Y1G64DS8HB1G are unbuffered 200-Pin Double Data Rate (DDR) Synchronous DRAM Unbuffered Dual In-Line

Design of Clock Distribution in High Performance Processors

IMME256M64D2SOD8AG (Die Revision E) 2GByte (256M x 64 Bit)

COSC 6385 Computer Architecture - Memory Hierarchies (III)

CS 152 Computer Architecture and Engineering

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Each Milliwatt Matters

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

LE4ASS21PEH 16GB Unbuffered 2048Mx64 DDR4 SO-DIMM 1.2V Up to PC CL

256MEGX72 (DDR2-SOCDIMM W/PLL)

EEM 486: Computer Architecture. Lecture 9. Memory

DDR2 SDRAM UDIMM MT8HTF12864AZ 1GB

Getting CPI under 1: Outline

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

DDR SDRAM UDIMM MT16VDDT6464A 512MB MT16VDDT12864A 1GB MT16VDDT25664A 2GB

1. The values of t RCD and t RP for -335 modules show 18ns to align with industry specifications; actual DDR SDRAM device specifications are 15ns.

CS425 Computer Systems Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

The Alpha and Microprocessors:

IMME256M64D2DUD8AG (Die Revision E) 2GByte (256M x 64 Bit)

The Design of the KiloCore Chip

DDR SDRAM SODIMM MT16VDDF6464H 512MB MT16VDDF12864H 1GB

Unleashing the Power of Embedded DRAM

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

+1 (479)

An Overview of Standard Cell Based Digital VLSI Design

ENEE 759H, Spring 2005 Memory Systems: Architecture and

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

IMM64M64D1DVS8AG (Die Revision D) 512MByte (64M x 64 Bit)

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

EE241 - Spring 2007 Advanced Digital Integrated Circuits. Announcements

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

CS 152 Computer Architecture and Engineering

Memory Hierarchies 2009 DAT105

COEN-4730 Computer Architecture Lecture 12. Testing and Design for Testability (focus: processors)

DDR3 Memory for Intel-based G6 Servers

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Processor That Don't Cost a Thing

DDR SDRAM UDIMM MT8VDDT3264A 256MB MT8VDDT6464A 512MB For component data sheets, refer to Micron s Web site:

Memory Hierarchy and Caches

COMPUTER ARCHITECTURES

Transcription:

A 1.5GHz Third Generation Itanium Processor Jason Stinson, Stefan Rusu Intel Corporation, Santa Clara, CA 1

Outline Processor highlights Process technology details Itanium processor evolution Block diagram Cache circuit design details Package details Front-side bus interface Clock distribution Power dissipation RAS, DFT and DFM features Frequency shmoo Summary 2

130nm Itanium 2 Processor Highlights 410M transistors 374mm 2 die size 6MB on-die L3 cache 1.5GHz at 1.3V 6.4GB/s 400MT/s 4-way bus interface Plug-in compatible with existing platforms Extensive RAS, DFT and DFM features Largest microprocessor transistor count and on-die cache 3

130nm Process Characteristics Attribute Lgate M1 pitch M2 pitch M3 pitch M4 pitch M5 pitch M6 pitch Dielectric Memory cell Value 60nm 350nm 448nm 448nm 756nm 1120nm 1204nm FSG, K=3.6 2.45µm 2 S. Thompson et al, IEDM 2001 (top) S. Tyagi, et al, IEDM 2000 (bottom) 4

Itanium Processor Evolution Attribute Architecture Process Device Count On-die L3 cache Frequency Itanium Processor Explicitly Parallel Instruction Computing 180nm 25M 0* 800MHz Itanium 2 Processor 180nm 221M 3MB 1.0GHz This work 130nm 410M 6MB 1.5GHz Supply Voltage 1.6V 1.5V 1.3V Power 130W* 130W 130W * Includes 4MB cache on cartridge 5

Block Diagram L3 Cache ECC ECC ECC L2 Cache Quad Port ECC Branch Prediction 11 Issue Ports Scoreboard, Predicate, NATs, Exceptions L1 Instruction Cache and Fetch/Pre-fetch Engine Instruction Queue ITLB B B B M M M M I I F F Register Stack Engine / Re-Mapping Branch & Predicate Registers Branch Units IA-32 Decode and Control 128 Integer Registers 128 FP Registers Integer and MM Units 8 bundles Quad-Port L1 Data Cache and DTLB ALAT Floating Point Units ECC ECC Bus (128b data, 6.4GB/s @400MT/s) 6

Cache Summary Attribute L1I L1D L2 L3 Size 16K 16K 256K 6M Line Size 64B 64B 128B 128B Ways 4 4 8 24 Replacement LRU NRU NRU NRU Latency 1-Fetch:1 INT:1 FP: NA INT: 5 FP: 6 14 Write Policy - WT (RA) WB (WA) WB (WA) Bandwidth R: 48GBs R: 24GBs W: 24GBs R: 48GBs W: 48GBs R: 48GBs W: 48GBs Cache bandwidth increased by 50% L3 cache latency increased to 14 clocks and set associativity doubled 7

L1 Instruction Cache Circuit Detail p0wl p1wl p1_local_bl p0_local_bl p1_global_bl p0_global_bl p1_write_data write_en# fill_cycle_1 p0_write_data coreclock fill_cycle_1 write_en# p1_local_bl write-1 read-2 read-3 p0_local_bl read-1 write-1 8

L1 Instruction Cache Circuit Detail p0wl p1wl p1_local_bl p0_local_bl p1_global_bl p0_global_bl p1_write_data write_en# fill_cycle_1 p0_write_data coreclock fill_cycle_1 write_en# p1_local_bl write-1 read-2 read-3 p0_local_bl read-1 write-1 9

L1 Instruction Cache Circuit Detail p0wl p1wl p1_local_bl p0_local_bl p1_global_bl p0_global_bl p1_write_data write_en# fill_cycle_1 p0_write_data coreclock fill_cycle_1 write_en# p1_local_bl write-1 read-2 read-3 p0_local_bl read-1 write-1 10

Way[11:0] BLOCK7 BLOCK6 Way[23:12] BLOCK7 BLOCK6 Decoders Sense amplifiers Timers L3 Subarray BLOCK5 BLOCK5 Core BLOCK4 BLOCK4 1588u BLOCK3 I/O mux BLOCK3 L3 Cache BLOCK2 BLOCK1 BLOCK0 BLOCK2 BLOCK1 BLOCK0 L3 cache contains 140 subarrays tiled to fit irregular shape of core 776u 11

Package Details Power delivery connector Server Management Components Flip-Chip BGA package with Integrated Heat Spreader Interposer Substrate 12

Package Decoupling Vdd (core) Vtt (FSB) PLL filter Power delivery connector 13

Front Side Bus P1 P3 CS P2 P4 Interface Support System Topology Termination Voltage Voltage Reference Data Bus Width Data Bus Speed Data Strobes Peak BW Address, Control Speed Glueless 4-way Multi-Processor Dual-sided board, staggered vias 1.2V, common ground with core Ground-referenced, 0.75V Vref 128-bit 400MT/s source synchronous 1 differential strobe for 16b of data 6.4GB/s 200MHz common clock 14

Front-Side Bus Topology Previous implementation Four linear stripes This work U-shape Core Core L3 Cache L3 Cache Data I/O Address I/O Control I/O 15

Itanium Processor Clock Distribution Trends Attribute Itanium Processor Itanium 2 Processor This work Process/Metal 180nm / Al 180nm / Al 130nm / Cu Primary tree Single ended Differential Differential Local distribution Grid Tree Tree Clock skew [ps] 28ps 62ps 24ps Deskew Method Active On-demand Fuse-based 16

Fuse-Based Clock Deskew ITC (TAP) SLCB scan chain SLCB Ck Gen Unit PD SLCB SLCBO zones FUSE Unit SLCB FUSE settings SLCB (3-bit/SLCB) SLCB = Second Level Clock Buffer 17

Skew (ps) 30 20 10 0-10 -20-30 -40-50 -60-70 Clock Zone Skew Plot No deskew Fuse-based deskew Scan-based deskew 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Clock Zone Worst case clock skew is 24ps in fuse mode and 7ps in scan mode 18

Power Same 130W power envelope as the 0.18um Itanium 2 processor 50% frequency increase 2X larger L3 cache Leakage increased 3.5X 5% 14% 7% 74% This work dynamic power I/O power core leakage/static cache leakage Aggressive management of dynamic power Reduced clock loading Reduced contention power L3 cache power management 5% 5% 90% Itanium 2 Processor dynamic power I/O power all leakage/static 19

L3D Power Reduction Scheme Previous Implementation This work Index [4:3] Index [4:3] == 11 Bank 1 Request Index [4:3] == 10 Bank 0 Request Index [4:3] == 01 Index [4:3] == 00 8 8 8 8 8 8 8 8 Bank 1 Request Bank 0 Request 8 8 8 8 8 8 8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Data in [7:0] Data out [7:0] 2 Data in [7:0] 8 8 8 8 8 8 8 8 8 8 Data out [7:0] 8 8 20

Reliability Features Core Pipeline L1I (16KB) L2 (256KB) Data L3 (6MB) Data + Tag L1D (16KB) Tag Bus Unit Error Log Timer ECC protected ITLB1 (32) DTLB1 (32) IPC (128) DTLB2 (128) HW Page Walker Bus Queues Data Poisoning Data Bus Addr Bus Parity protected System bus 21

DFT/DFM Feature Summary Feature Itanium Processor Itanium 2 Processor This work Scan Coverage 48K 140K 140K Scanout Coverage 5.5K 24K 24K Cache DAT Mode (major arrays) Yes Yes Yes L3 Redundancy / Repair N/A Dual Quad Weak-Write Test Mode Fixed Fixed Programmable IO DFT Basic IO Loopback Limited IO Loopback Enhanced IO Loopback Dynamic Frequency Adjustment Multi-cycle shrink/stretch Single cycle shrink/stretch Multi-cycle shrink/stretch On-die process monitors No No Yes 22

Thermal Protection Features Temperature Thermal Alert ETM Thermal Trip I res I diode ETM Throttle Thermal Trip Time Diode Reference 23

Frequency Shmoo PASS FAIL 1:9 Bus Period (ns) Core Frequency 1.80GHz 1.73GHz 1.66GHz 1.61GHz 1.55GHz 1.50GHz 1.45GHz 1.40GHz Frequency increased by 50% from previous generation 24

Summary The 2003 version of the Itanium 2 processor (Madison) delivers 2X larger on-die cache and 50% increase in frequency Compatible with today s Itanium 2-based systems Enterprise-class RAS, DFT and DFM features Largest on-die cache and transistor count ever reported for a microprocessor 6.4GB/s multi-drop bus interface Clock de-skew technique achieves 24ps skew in fuse-mode and 7ps in scan mode across entire die 25