A 5.2 GHz Microprocessor Chip for the IBM zenterprise

Similar documents
Puey Wei Tan. Danny Lee. IBM zenterprise 196

POWER7: IBM's Next Generation Server Processor

POWER7: IBM's Next Generation Server Processor

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc.

edram to the Rescue Why edram 1/3 Area 1/5 Power SER 2-3 Fit/Mbit vs 2k-5k for SRAM Smaller is faster What s Next?

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Unleashing the Power of Embedded DRAM

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

ECE 571 Advanced Microprocessor-Based Design Lecture 24

MEMORIES. Memories. EEC 116, B. Baas 3

UCLA 3D research started in 2002 under DARPA with CFDRC

COMPUTER ARCHITECTURES

Macro in a Generic Logic Process with No Boosted Supplies

The Memory Hierarchy 1

Dynamic Voltage and Frequency Scaling Circuits with Two Supply Voltages

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

On GPU Bus Power Reduction with 3D IC Technologies

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Copyright 2012, Elsevier Inc. All rights reserved.

IBM's POWER5 Micro Processor Design and Methodology

Novel Methodology for Mid-Frequency Delta-I Noise Analysis of Complex Computer System Boards and Verification by Measurements

COSC 6385 Computer Architecture - Memory Hierarchies (II)

LECTURE 5: MEMORY HIERARCHY DESIGN

A 1.5GHz Third Generation Itanium Processor

Lecture-14 (Memory Hierarchy) CS422-Spring

Adapted from David Patterson s slides on graduate computer architecture

Mainstream Computer System Components

Lecture 14. Advanced Technologies on SRAM. Fundamentals of SRAM State-of-the-Art SRAM Performance FinFET-based SRAM Issues SRAM Alternatives

Memory in Digital Systems

Computer System Components

Memory in Digital Systems

Case Study 1: Optimizing Cache Performance via Advanced Techniques

A 32nm, 0.9V Supply-Noise Sensitivity Tracking PLL for Improved Clock Data Compensation Featuring a Deep Trench Capacitor Based Loop Filter

Chapter 8 Memory Basics

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Design of Low Power Wide Gates used in Register File and Tag Comparator

ECE 485/585 Microprocessor System Design

ELE 455/555 Computer System Engineering. Section 1 Review and Foundations Class 3 Technology

Memory in Digital Systems

COSC 6385 Computer Architecture - Memory Hierarchies (III)

Concurrent High Performance Processor design: From Logic to PD in Parallel

Memory Hierarchy and Caches

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

The University of Adelaide, School of Computer Science 13 September 2018

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CENG4480 Lecture 09: Memory 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory hierarchy Outline

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

An Overview of Standard Cell Based Digital VLSI Design

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

CENG3420 Lecture 08: Memory Organization

ECEN 449 Microprocessor System Design. Memories. Texas A&M University

A Semi-custom Design Flow in High-performance Microprocessor Design

DRAM with Boosted 3T Gain Cell, PVT-tracking Read Reference Bias

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

POWER4 Test Chip. Bradley D. McCredie Senior Technical Staff Member IBM Server Group, Austin. August 14, 1999

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

KiloCore: A 32 nm 1000-Processor Array

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

More Course Information

CS/EE 6810: Computer Architecture

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Computer Systems Laboratory Sungkyunkwan University

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell

t RP Clock Frequency (max.) MHz

CS152 Computer Architecture and Engineering Lecture 16: Memory System

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

GHz Asynchronous SRAM in 65nm. Jonathan Dama, Andrew Lines Fulcrum Microsystems

Near-Threshold Computing: Reclaiming Moore s Law

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

ECE 485/585 Microprocessor System Design

Bridging High Performance and Low Power in the era of Big Data and Heterogeneous Computing

A novel DRAM architecture as a low leakage alternative for SRAM caches in a 3D interconnect context.

Open Innovation with Power8

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY /$ IEEE

EE-382M VLSI II. Early Design Planning: Front End

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

! Memory. " RAM Memory. " Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory

[Kalyani*, 4.(9): September, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

Copyright 2012, Elsevier Inc. All rights reserved.

CpE 442. Memory System

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

Memory Systems IRAM. Principle of IRAM

CSE502: Computer Architecture CSE 502: Computer Architecture

ENEE 759H, Spring 2005 Memory Systems: Architecture and

VLSI Design Automation. Maurizio Palesi

Congestion-Aware Power Grid. and CMOS Decoupling Capacitors. Pingqiang Zhou Karthikk Sridharan Sachin S. Sapatnekar

Survey on Stability of Low Power SRAM Bit Cells

Transcription:

A 5.2 GHz Microprocessor Chip for the IBM zenterprise TM System J. Warnock 1, Y. Chan 2, W. Huott 2, S. Carey 2, M. Fee 2, H. Wen 3, M.J. Saccamango 2, F. Malgioglio 2, P. Meaney 2, D. Plass 2, Y.-H. Chan 2, M. Mayo 2, G. Mayer 4, L. Sigal 5, D. Rude 2, R. Averill 2, M. Wood 2, T. Strach 4, H. Smith 2, B. Curran 2, E. Schwarz 2, L. Eisen 3, D. Malone 2, S. Weitzel 3, P.-K. Mak 2, T. McPherson 2, C. Webb 2 IBM Systems and Technology Group: 1 - Yorktown Heights, NY 2 - Poughkeepsie, NY 3 Austin, TX 4 - Boeblingen, Germany 5 IBM Research, Yorktown Heights, NY J. Warnock ISSCC 2011 1

Outline Introduction Technology and chip overview RAS Clock grids Core circuit design L3 design Power and noise considerations Chip frequency tuning Conclusions J. Warnock ISSCC 2011 2

Introduction: zenterprise 196 (z196) Single-thread performance is critical for system z applications Starting point for z196: high-frequency z10 design Z10 cores run at 4.4 GHz, 65nm technology Move to 45nm technology for z196 Maintain low-fo4 pipeline for maximum frequency boost Add out-of-order execution: improved IPC Design improvements to reduce power dissipation Improvements to 4-level cache hierarchy Result: Achieved 5.2 GHz operating frequency Same power envelope as previous system Up to 40% improvement seen on legacy workloads J. Warnock ISSCC 2011 3

Outline Introduction Technology and chip overview RAS Clock grids Core circuit design L3 design Power and noise considerations Chip frequency tuning Conclusions J. Warnock ISSCC 2011 4

Chip Technology & Design Overview High-performance 45nm SOI technology Embedded DRAM Deep trench decoupling capacitors available Four logic threshold voltage options Low V T option for ultra-critical paths 2 Extra high-performance wiring planes added 4 cores, 1.5MB L2/core, 24MB shared L3 Two co-processors, DDR3 RAIM controller, I/O bus controller (GX) 1.4B Transistors, 512mm 2 chip area J. Warnock ISSCC 2011 5

Main Memory IOs Main Memory IOs Memory Controller Core 0 Co- Proc. Core 1 SC IO Interface L3 cache (edram) L3 Data Stack L3 Controller L3 Controller L3 Data Stack L3 cache (edram) SC IO Interface Core 2 Co- Proc. Core 3 GX IO Interface GX Logic GX IO Interface J. Warnock ISSCC 2011 6

Chip RAS Features Base technology: SOI provides SER advantage Component-level hardening against SER Stacked devices in most clocked storage elements Extensive circuit-level techniques used Parity, residues, local duplication of function Checking overhead estimated at 20-25% for digital logic On-chip caches: ECC, parity Memory: RAIM ECC, Bus CRC, Tiered Recovery Recovery Unit (RU) Maintains checkpointed states RU restarts processor from checkpointed state when error detected J. Warnock ISSCC 2011 7

Chip Clock Grids High frequency grids over each core 1:1 clock period, 5.2 GHz Grid extends over L2 control region (circuits run at 2:1 gear ratio) Half-speed grid over L2 & nest: 2:1 grid : 2.6 GHz Most of nest is geared down by a factor of 2 = 4:1 Synchronous interfaces from core to L2 1:1 -> 2:1, on separate grids Synchronous interfaces from L2 to L3 2:1 -> 4:1, but on the same grid Separate asynchronous grids over I/O region Some overlap of synchronous and asynchronous grids J. Warnock ISSCC 2011 8

Main Memory IOs Core 0 SC IO Interface L3 cache (edram) L3 Data Stack Core 2 GX IO Interface 1:1 grid Memory Controller Co-Processor L3 Controller L3 Controller Co-Processor GX Logic Main Memory IOs Core 1 L3 Data Stack L3 cache (edram) Core 3 GX IO Interface SC IO Interface J. Warnock ISSCC 2011 9

Main Memory IOs Core 0 SC IO Interface L3 cache (edram) L3 Data Stack Core 2 GX IO Interface 1:1 grid 1:1 grid (Circuits at 2:1) Memory Controller Co-Processor L3 Controller L3 Controller Co-Processor GX Logic Main Memory IOs Core 1 L3 Data Stack L3 cache (edram) Core 3 GX IO Interface SC IO Interface J. Warnock ISSCC 2011 10

Main Memory IOs Memory Controller Core 0 Co-Processor SC IO Interface L3 cache (edram) L3 Data Stack L3 Controller L3 Controller Core 2 Co-Processor GX IO Interface GX Logic 1:1 grid 1:1 grid (Circuits at 2:1) 2:1 grid 2:1 grid (Circuits at 4:1) Main Memory IOs Core 1 L3 Data Stack L3 cache (edram) Core 3 GX IO Interface SC IO Interface J. Warnock ISSCC 2011 11

Main Memory IOs Memory Controller Main Memory IOs Core 0 Co-Processor Core 1 SC IO Interface L3 cache (edram) L3 Data Stack L3 Controller L3 Controller L3 Data Stack L3 cache (edram) Core 2 Co-Processor Core 3 GX IO Interface GX Logic GX IO Interface 1:1 grid 1:1 grid (Circuits at 2:1) 2:1 grid 2:1 grid (Circuits at 4:1) Asynch grids SC IO Interface J. Warnock ISSCC 2011 12

Outline Introduction Technology and chip overview RAS Clock grids Core circuit design L3 design Power and noise considerations Chip frequency tuning Conclusions J. Warnock ISSCC 2011 13

Processor Core Circuit Design Custom dataflow implementation Static CMOS design with parameterized gates Automated device width, VT tuning (pre- and post-layout) Aggressive use of pulsed local clocks for power savings Widespread fine-grained local clock gating Custom high-speed memory elements 64KB I cache, 128 KB D cache Dynamic circuits for critical access paths Synthesized control logic Structured placement of clocked storage elements Custom-like techniques for robust pulsed-clock routing J. Warnock ISSCC 2011 14

Custom Design Methodology Design Partition ( macro ) Custom Schematic Placement info, special wires, etc. Placed Design Wire RC Estimation Tuner: area, power, timing optimization Estimated R, C Routing Timing Constraints, Requirements Macro Layout Parasitic extraction Tuner: Layout-aware power, timing optimization J. Warnock ISSCC 2011 15

Single-cycle FXU Execution Loop (Previous Generation Design) Dynamic MUX-latch J. Warnock ISSCC 2011 16

Single-cycle FXU Execution Loop Controls (Current Design) lclk lclk Static Logic + Pulsed-clock latch operands Operand select & control lclk lclk out_b l2_b Output to adder, rotate, etc. scan_out Scan clocks l2clk l2clk l2clk l2clk dclk l1_q lclk dclk dclk scan_in dclk dclk l1_b l2clk lclk l2clk dclk l2 J. Warnock ISSCC 2011 17

L1 D-cache Critical 4-cycle Access Loop Dual phase R/W column ckt local bit lines local r/w ckt Low V T LSU pipe0 RA ra wa cmp AGEN RB ra0 Cycle boundary wa LSU pipe1 ra1 A0 rst c rbl_t local bit lines local r/w ckt rbl_t SETP 64 entry 8W 14b 8w cmp L1 D-Cache 128KB 8-way (8* 512x8wx36b) A1 wrt bd wbl_c di_c do_c wbl_t di_t do_t dyn driver Array boundary late_sel Dynamic Logic dyn mux dout0 32B format dyn mux dout1 32B format A2 A3 J. Warnock ISSCC 2011 18

Outline Introduction Technology and chip overview RAS Clock grids Core circuit design L3 design Power and noise considerations Chip frequency tuning Conclusions J. Warnock ISSCC 2011 19

L3 Design 24MB on-chip shared L3 cache Uses high-performance DRAM macros 1.54ns access time for individual DRAM macro 196 MB 4 th level cache (L4) on separate SC chip Equal L3 access time from any of the 4 cores 45 core processor cycles for L3 hit Near/far data clocked on alternate 2:1 cycles Allows sharing of circuitry for near/far data Lower level caches are store-through Drives significant store activity in the L3 => highly interleaved design J. Warnock ISSCC 2011 20

L3 Timing Diagram C0 C1 C2 C3 C4 C5 C6 L3 Dir L3 cache Reach ECC Core return ECC Reach L3 cache Reach ECC Core return Clocked in-phase near C2 C3 256x8x144 256x8x144 C4 12bits 2:1 Mux C4.5/C5 Fetch ECC 64/72 C5/C5.5 Fetch L3 data stack Fetch return mux Clocked out-of-phase far C2.5 C3.5 256x8x144 256x8x144 C4.5 12bits Near/Far DW Mux ECC circuitry shared by time slicing C7 C5.5/C6 Fetch 0.77 ns C6/C6.5 Fetch J. Warnock ISSCC 2011 21

Outline Introduction Technology and chip overview RAS Clock grids Core circuit design L3 design Power and noise considerations Chip frequency tuning Conclusions J. Warnock ISSCC 2011 22

Chip Power Considerations Fixed chip power budget Roughly same budget as last-generation design. But: Higher frequency Larger chip area Higher capacitance density (from technology scaling) Net: Design team faced significant power issues Focused effort on power reduction Keep about same cycle time (in FO4) as previous design But improve power efficiency to enable higher frequency Net: power efficiency improved by ~ 25% Translation: ~ 8-10% improved frequency at const power J. Warnock ISSCC 2011 23

Chip Power Methodology Input patterns, constraints, conditions, etc. Power vs input switching factor Power vs clock gating percentage Leakage Power Design Partition ( macro ) Power Analysis Macro Power Model High-level logic simulation Macro-level switching factors, clock gating Chip Power Analysis J. Warnock ISSCC 2011 24

Chip Power Breakdown J. Warnock ISSCC 2011 25

DC Leakage Breakdown by Device Type DC leakage = 30% of total power J. Warnock ISSCC 2011 26

Power Supply Noise Considerations Significant work on power efficiency, clock gating Increases gap from minimum to maximum power states Sudden switching current transients, but multi-cycle response time through package Need good amount of on-chip decoupling capacitance > 10 µf capacitance added on base power supply Use DRAM trench for dense decoupling cap Decoupling also needed on array supplies SRAM, DRAM Early hardware: noise on DRAM supply from bursts of high refresh/access activity J. Warnock ISSCC 2011 27

DRAM Supply Voltage Noise Early hardware, 265nF cap on array supplies ~313mV peak-to-peak Later hardware, 1.74uF cap on array supplies ~136mV peak-to-peak J. Warnock ISSCC 2011 28

Outline Introduction Technology and chip overview RAS Clock grids Core circuit design L3 design Power and noise considerations Chip frequency tuning Conclusions J. Warnock ISSCC 2011 29

Hardware Frequency Tuning Local (macro-level) controls with fine granularity Local clock pulse width (narrow, nom, wide) Local clock pulse timing (nom, late) Master-clock falling edge delay (MSFF designs) Local clock gating override Array and register files: pulse width & various timing settings Global controls: clock duty cycle adjustments Other controls for debug and critical path isolation Control settings optimized for max overall f max J. Warnock ISSCC 2011 30

Chip V min vs Process Speed Fixed frequency: 5.4 GHz Minimum Operating Voltage at 5.4 GHz (V) 1.3 1.25 1.2 1.15 1.1 Vmin with default settings f 0.9 1 1.1 Process Speed (abitrary units) Optimized Control Settings (data average) J. Warnock ISSCC 2011 31

Outline Introduction Technology and chip overview RAS Clock grids Core circuit design L3 design Power and noise considerations Chip frequency tuning Conclusions J. Warnock ISSCC 2011 32

z196 Design: Conclusion Design team able to maintain high-frequency pipeline of z10 while adding out-of-order execution Large on-chip DRAM L3 for additional performance Deep trench decoupling capacitors provide additional frequency leverage Technology features + design for power efficiency =>18% freq boost compared to previous generation Net: - 5.2 GHz final product frequency - up to 40% perf improvement (single thread legacy workload metric) J. Warnock ISSCC 2011 33

Acknowledgements The authors would like to acknowledge the many contributions from the rest of the System z team, the IBM EDA team, and the IBM Technology Development and Manufacturing teams J. Warnock ISSCC 2011 34