Design Objectives of the 0.35µm Alpha Microprocessor (A 500MHz Quad Issue RISC Microprocessor)

Similar documents
Jim Keller. Digital Equipment Corp. Hudson MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Digital Semiconductor Alpha Microprocessor Product Brief

The Alpha and Microprocessors:

The T0 Vector Microprocessor. Talk Outline

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Good luck and have fun!

Itanium 2 Processor Microarchitecture Overview

MIPS R5000 Microprocessor. Technical Backgrounder. 32 kb I-cache and 32 kb D-cache, each 2-way set associative

Microarchitecture Overview. Performance

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Hummingbird: A Low-Cost Superscalar PA-RISC Processor

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

620 Fills Out PowerPC Product Line

EECS 322 Computer Architecture Superpipline and the Cache

Digital Leads the Pack with 21164

Keywords and Review Questions

FABRICATION TECHNOLOGIES

Microarchitecture Overview. Performance

Introducing Multi-core Computing / Hyperthreading

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

A 1.5GHz Third Generation Itanium Processor

Advanced Computer Architecture

Preliminary, February Full 64-bit Alpha architecture: - Advanced RISC architecture - Optimized for high performance

UltraSPARC-II ( TM) : The Advancement of UltraComputing

For all questions, please show your step-by-step solution to get full-credits.

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

EE5780 Advanced VLSI CAD

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

SH4 RISC Microprocessor for Multimedia

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

Performance Characteristics. i960 CA SuperScalar Microprocessor

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

POWER3: Next Generation 64-bit PowerPC Processor Design

Digital Semiconductor. StrongARMARM

Portland State University ECE 588/688. Cray-1 and Cray T3E

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

ARM Processors for Embedded Applications

File: 'ReportV37P-CT89533DanSuo.doc' CMPEN 411, Spring 2013, Homework Project 9 chip, 'Tiny Chip' fabricated through MOSIS program

FPGA Programming Technology

Freescale Semiconductor, I

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

CPE300: Digital System Architecture and Design

Introduction to Microprocessor

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

MIPS R4300I Microprocessor. Technical Backgrounder-Preliminary

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

A Zero Wait State Secondary Cache for Intel s Pentium

Spiral 2-8. Cell Layout

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

A Memory System Design Framework: Creating Smart Memories

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE) UNIT-I

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Chapter 5 Internal Memory

CS152 Computer Architecture and Engineering. Complex Pipelines

The Nios II Family of Configurable Soft-core Processors

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Topics in computer architecture

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

An Overview of Standard Cell Based Digital VLSI Design


UltraSparc-3 Aims at MP Servers

Advanced cache optimizations. ECE 154B Dmitri Strukov

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

VLSI Design Automation

Course overview Computer system structure and operation

Multiple Instruction Issue. Superscalars

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

Structure of Computer Systems

DRAM Fab partnership for Intelligent RAM (IRAM)

Intel released new technology call P6P

Superscalar Processors

Computer Organization. 8th Edition. Chapter 5 Internal Memory

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

PACE: Power-Aware Computing Engines

ECE 341 Final Exam Solution

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

Cache Organizations for Multi-cores

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

The Memory Hierarchy Part I

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Digital Sets New Standard

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Professor Lee, Yong Surk. References 고성능마이크로프로세서구조의개요. Topics Microprocessor & microcontroller

Design Techniques for Implementing an 800MHz ARM v5 Core for Foundry-Based SoC Integration. Faraday Technology Corp.

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

A brief History of INTEL and Motorola Microprocessors Part 1

Multi-Core Microprocessor Chips: Motivation & Challenges

CS 152 Computer Architecture and Engineering

Transcription:

Design Objectives of the 0.35µm Alpha 21164 Microprocessor (A 500MHz Quad Issue RISC Microprocessor) Gregg Bouchard Digital Semiconductor Digital Equipment Corporation Hudson, MA 1

Outline 0.35µm Alpha 21164 Overview Design Goals Technology Issues 21164 Internal Architecture Review Architectural Enhancements System Performance Enhancements Alpha Processor RoadMap Summary 2

0.35µm ALPHA 21164 Digital Semiconductor Technology shrink of 0.5µm design + Architectural Enhancements + System Performance Enhancements Transistor Count Die Size Power Supply WC Power Dissipation Target Cycle Time 0.5 µm process 0.35 µm process 9.3 Million 9.66 Million 16.5 mm x 18.1 mm 14.4 mm x 14.5 mm 3.3V external 3.3V external 3.3V internal 2.5V internal 50W @ 300 MHz 37W @ 433 MHz 300 MHz 433 MHz 3

Design Goals 4 Reduced Cost Die size reduction Remove processing steps Higher Performance 433 MHz design target Architectural enhancements Reduced Power Lower Core Operating Voltage (2.0v-2.5v) Time to Market Significant leverage from previous design

Die Size Analysis X-dim Y-dim NormA Original 16.5 18.1 1.00 25% shrink - 4.1-4.5 12.4 13.6 0.56 Conversion +0.0 +0.8 12.6 14.5 0.62 LI Removal +1.8 +0.0 14.4 14.5 0.70 Actual 14.4 14.5 0.70 Ideal 30% 11.5 12.7 0.49 Original Die 16.5mm x 18.1mm 298 mm 2 Actual shrink 14.4mm x 14.5mm 209 mm 2 Ideal 30% shrink 11.5mm x 12.7mm 146 mm 2 5

Layout Conversion Strategy Full 30% linear shrink was not possible Solution: 25% linear shrink Semi-automated conversion of design rules Polysilicon mask layer pushed to full 30% shrink dimensions for performance Redesign of caches Local Interconnect removed 6

Layout Conversion Example Well Diffusion Well Plug Poly M1C Metal 1 M2C Metal 2 DRC Errors 7 Original After Script Final

Speed Single-wire, two phase clocking scheme 14 gates per cycle including latches Single global clock grid Global clock skew<90ps Local clock skew<25ps Clock statistics (0.5 µm design) Clock load = 3.75 nf Size of final clock inverter = 58 cm Edge rate = 0.5 ns Clocking consumes 40% of chip power di/dt = 50 A 8

Speed Verification Test circuits were used to determine the speed scaling of different circuit configurations Predicted average process speed up Identified slow circuit types New circuits were evaluated in SPICE Chip sections with major modifications were completely re-verified 9

Power Significant Power Reduction Vddi = 2.2v (to 2.5v) 3.3V only interface to external devices 0.5µm 3.3v 0.35µm 2.5v 50 40 30 Watts 20 I/O Clock Core 10 10 0 300 366 433 500 MHz

Pre- CLK Drivers L2 Cache L1 I Cache F- Box I-Box E- Box L2 Cache Main CLK Drivers M-Box Pad Ring (I/O) L2 Tags C-Box L1 D Cache

Alpha 21164 Features 12 Key Attributes 4-way issue superscalar Up to 2 Integer AND 2 Floating Point instructions issued per CPU cycle. Large on-chip L2 cache 96KB, writeback, 3-way set associative Fully Pipelined 7-stage integer pipeline 9-stage floating point pipeline Emphasis on low latency at high clock rate High-throughput memory subsystem

Alpha 21164 Block Diagram Instruction Unit Execution Units Memory Unit Instruction Cache (8KB) Fourway Issue Unit Integer Unit Integer Unit FP Adder Merge Logic Write- Through Data Cache (8KB) Writeback L2 Cache (96KB) Bus Interface Unit 40b Address L3 Cache FP Mult. 128b Data 128-bit internal data bus ITB: 48 entry DTB: 64 entry 13

Instruction Issue Pipeline Review 2kx2 branch prediction Next PC 0 Issue Conflict Checker Instruction Cache (8KB) Instruction Buffer X to floating point multiply pipeline to floating point add pipeline to integer pipeline 0 1 to integer pipeline 1 Prefetch Buffer Instruction Slot Instruction Issue 14 S0 S1 S2 S3

Execution Pipeline Review Int mul Int Regs Integer Pipeline 0: arith, logical, ld/st, shift Integer Pipeline 1: arith, logical, ld, br/jmp FP div FP Regs FP Pipeline 0: add, subtract, compare, FP br FP Pipeline 1: multiply 15 S3 S4 S5 S6 S7 S8

On-chip Cache Resources Review 2 System Commands 6 Data Reads 6 Data Writes 3 Instr. Reads 128 bits/cycle S6 S7 S8 S9 96K, 3 Set, Writeback L2 Cache System data Miss Addr 0 Miss Addr 1 Victim Addr 0 Victim Addr 1 Victim 0 data Victim 1 data 40 bit PA 16

L3 Cache (off-chip) 21164 17 Bus Address File Command/Addr Index L3 Tag L3 Data 128-bit data bus L3 cache is a direct-mapped writeback superset of onchip L2 cache Up to 2 reads (or outstanding read commands) in L3 cache Programmable wave pipelining for L3 cache Support for Synchronous Flow-Thru SRAMs L3 cache is optional

Off-Chip L3 Cache Options Digital Semiconductor Selectable via on-chip programmable registers u Cache Size u Cache Read/Write Speed u Read to Write Spacing u Write Pulse (Bit Mask) u Wave pipelining - 1 to 64M Byte - 4 to 15 cpu cycles - 1 to 7 cpu cycles - Up to 9 cpu cycles - 0 to 3 cpu cycles u Support for Synchronous SRAM s 18

Architectural Enhancements New Instructions Scalar support for Byte and Short data types LDBU, LDWU load an unsigned byte or short STB, STW, store an byte or short SEXTB, SEXTW sign extend a byte or short Eases porting of device drivers to Alpha Improves emulation of Intel code on Alpha Implemented in this and all future Alpha microprocessors 19

Architectural Enhancements New Instructions (continued) IMPLVER returns a small integer indicating the core design used for code scheduling decisions implemented in all Alpha microprocessors AMASK clears bits to indicate which features are present implemented in all Alpha microprocessors 20 Example: AMASK #1, R0 ;byte/word present BNE R0, emulate ;if not emulate...

System Performance Enhancements Digital Semiconductor 21164 1 oe_we_active_low (eliminates off-chip buffering) big_drv_en (50% more drive) index Bcache L H 1 0 tag_oe,we data_oe,we 2 st_clk1,2 (additional pin) tag_data data 128 2 Tag Store Data Store 3 clk_mode<2> (1x clock mode) I 4 Bus utilization improvements (Auto Dack) E M cmd addr data dack Write A0 D00 D01 D02 D03 Write A1 D10 D11 5 S Cache Coherence O 21

Pre-Silicon Logic Verification Used extensive test suite developed for original chip as testing baseline Random and focused testing Coverage analysis to ensure excellent test coverage Three simulation systems used: RTL Transistor-level Gate-level 22

Alpha Microprocessor Road Map 25 20 SpecInt95 15 10 5 21064-200MHz 21064-150MHz 21164-500MHz 21164-433MHz 21164-400MHz 21164-366MHz 21164-333MHz 21164-300MHz 21064A -275MHz 23 0 1992 1993 1994 1995 1996 1997 1998

Summary FIRST PASS SILICON - October 1995 Booted first operating system in 2 days! Continued Performance Leadership (4+ years) Look for next generation 30+ SpecInt95 by Q3 97 24

Acknowledgments Peter J. Bannon, Michael S. Bertone, Randel P. Blake Campos, William J. Bowhill, David A. Carlson, Ruben W. Castelino, Dale R. Donchin, Richard M. Fromm, Mary K. Gowan, Paul E. Gronowski, Anil K. Jain, Bruce J. Loughlin, Shekhar Mehta, Jeanne E. Meyer, Robert O. Mueller, Andy Olesin, Tung N. Pham, Ronald P. Preston, Paul I. Rubinfeld 25