Hybrid Memory Cube (HMC)

Similar documents
Five Emerging DRAM Interfaces You Should Know for Your Next Design

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Extending the Benefits of GDDR Beyond Graphics

Memory as We Approach a New Horizon

High Performance Memory Opportunities in 2.5D Network Flow Processors

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

How Good Is Your Memory? An Architect s Look Inside SSDs

Addressing the Memory Wall

Technical Note. Adding ECC to a Data Bus with DDR4 x16 Components. Introduction. TN-40-41: Adding ECC With DDR4 x16 Components.

Emerging IC Packaging Platforms for ICT Systems - MEPTEC, IMAPS and SEMI Bay Area Luncheon Presentation

The Memory Hierarchy 1

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Stacking Untested Wafers to Improve Yield. The 3D Enigma

3DIC and the Hybrid Memory Cube

1 Copyright 2013 Oracle and/or its affiliates. All rights reserved.

NAND Flash Architecture and Specification Trends

IMM128M72D1SOD8AG (Die Revision F) 1GByte (128M x 72 Bit)

An introduction to SDRAM and memory controllers. 5kk73

40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Green Memory Solution. Jung-Bae Lee

IMM64M64D1SOD16AG (Die Revision D) 512MByte (64M x 64 Bit)

The 3D-Memory Evolution

DRAM Main Memory. Dual Inline Memory Module (DIMM)

Samsung Memory DDR4 SDRAM

Industry Trends in 3D and Advanced Packaging

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

IMM128M64D1DVD8AG (Die Revision F) 1GByte (128M x 64 Bit)

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

DDR SDRAM UDIMM. Draft 9/ 9/ MT18VDDT6472A 512MB 1 MT18VDDT12872A 1GB For component data sheets, refer to Micron s Web site:

DDR2 SDRAM UDIMM MT16HTF25664AZ 2GB MT16HTF51264AZ 4GB For component data sheets, refer to Micron s Web site:

Memories: Memory Technology

CS311 Lecture 21: SRAM/DRAM/FLASH

NVIDIA nforce IGP TwinBank Memory Architecture

Industry Collaboration and Innovation

Netronome NFP: Theory of Operation

DDR SDRAM UDIMM MT16VDDT6464A 512MB MT16VDDT12864A 1GB MT16VDDT25664A 2GB

IMM64M72D1SCS8AG (Die Revision D) 512MByte (64M x 72 Bit)

IBM XIV Storage System

Maximizing heterogeneous system performance with ARM interconnect and CCIX

A Single Chip Shared Memory Switch with Twelve 10Gb Ethernet Ports

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

4GB Unbuffered VLP DDR3 SDRAM DIMM with SPD

DDR SDRAM UDIMM MT8VDDT3264A 256MB MT8VDDT6464A 512MB For component data sheets, refer to Micron s Web site:

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

Open Innovation with Power8

Organization Row Address Column Address Bank Address Auto Precharge 128Mx8 (1GB) based module A0-A13 A0-A9 BA0-BA2 A10

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Options. Data Rate (MT/s) CL = 3 CL = 2.5 CL = 2-40B PC PC PC

2GB DDR3 SDRAM SODIMM with SPD

Samsung emcp. WLI DDP Package. Samsung Multi-Chip Packages can help reduce the time to market for handheld devices BROCHURE

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Memory Systems IRAM. Principle of IRAM

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Future Memories. Jim Handy OBJECTIVE ANALYSIS

Toward a Memory-centric Architecture

Messaging Overview. Introduction. Gen-Z Messaging

DDR SDRAM SODIMM MT8VDDT1664H 128MB 1. MT8VDDT3264H 256MB 2 MT8VDDT6464H 512MB For component data sheets, refer to Micron s Web site:

POWER7: IBM's Next Generation Server Processor

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Innovations in Non-Volatile Memory 3D NAND and its Implications May 2016 Rob Peglar, VP Advanced Storage,

BREAKING THE MEMORY WALL

Technology challenges and trends over the next decade (A look through a 2030 crystal ball) Al Gara Intel Fellow & Chief HPC System Architect

A 400Gbps Multi-Core Network Processor

HEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

POWER7: IBM's Next Generation Server Processor

Features. DDR2 UDIMM with ECC Product Specification. Rev. 1.2 Aug. 2011

Interposer Technology: Past, Now, and Future

DDR2 SDRAM UDIMM MT8HTF12864AZ 1GB

DFT Trends in the More than Moore Era. Stephen Pateras Mentor Graphics

Thermal-Aware Memory Management Unit of 3D- Stacked DRAM for 3D High Definition (HD) Video

IMM64M64D1DVS8AG (Die Revision D) 512MByte (64M x 64 Bit)

World s most advanced data center accelerator for PCIe-based servers

Enabling Technology for the Cloud and AI One Size Fits All?

3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER

Tile Processor (TILEPro64)

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

1. The values of t RCD and t RP for -335 modules show 18ns to align with industry specifications; actual DDR SDRAM device specifications are 15ns.

Intel: Driving the Future of IT Technologies. Kevin C. Kahn Senior Fellow, Intel Labs Intel Corporation

Forging a Future in Memory: New Technologies, New Markets, New Applications

Optimize your system designs using Flash memory

Kevin Donnelly, General Manager, Memory and Interface Division

DDR SDRAM SODIMM MT16VDDF6464H 512MB MT16VDDF12864H 1GB

The Evolution of Mobile

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Technical Note Using Micron Asynchronous PSRAM with the NXP LPC2292 and LPC2294 Microcontrollers

p5 520 server Robust entry system designed for the on demand world Highlights

Extreme Scale Computer Architecture: Energy Efficiency from the Ground Up

Linux Storage System Bottleneck Exploration

Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Memory Challenges. Issues & challenges in memory design: Cost Performance Power Scalability

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Power Technology For a Smarter Future

3D systems-on-chip. A clever partitioning of circuits to improve area, cost, power and performance. The 3D technology landscape

Transcription:

23 Hybrid Memory Cube (HMC) J. Thomas Pawlowski, Fellow Chief Technologist, Architecture Development Group, Micron jpawlowski@micron.com 2011 Micron Technology, I nc. All rights reserved. Products are warranted only to meet Micron s production data sheet specifications. I nformation, products, and/ or specifications are subject to change without notice. All information is provided on an AS IS basis without warranties of any kind. Dates are estimates only. Drawings are not to scale. Micron and the Micron logo are trademarks of Micron Technology, Inc. All other trademarks are the property of their respective owners. August 4, 2011 2011 Micron Technology, Inc. 1

Outline Problems Goals HMC introduction New relationship between CPU and memory I nternal architecture Performance Summary August 4, 2011 2011 Micron Technology, Inc. 2

Observed Problems: The Problems Latency (classic memory wall ) Bandwidth related issues Power / energy Many-core, Multi-threaded CPUs generate higher random request rates Memory capacity per unit footprint Future Problems: Scalability of bandwidth, densities, request rates and lower latencies Essential Underlying Issue: Direct control of memory must give way to memory abstraction Mitigate negative characteristics of next generation DRAM processes I ntroduction of future DRAM replacement technologies August 4, 2011 2011 Micron Technology, Inc. 3

Historic Energy x Bandwidth I mprovement Per DRAM Generation 8 7 6 Average = 4.5x per generation 5 4 3 2 1 0 DDR DDR II DDR3 DDR4 2011 Micron Technology, Inc. 4

Hybrid Memory Cube Goals 1. Higher bandwidth 2. Higher signaling rate 3. Lower energy per useful unit of work done 4. Lower system latency 5. I ncreased request rate, for many-core: 6. Higher memory packing density CONCURRENCY! 7. Abstracted interface 1. lighten CPU/ DRAM interaction 2. enable new DRAM management concepts 3. manage future process scaling and future technology introductions 8. Scalability for higher future bandwidths and density footprint 9. Reduce customer and Micron time to market August 4, 2011 2011 Micron Technology, Inc. 5

Hybrid Memory Cube (HMC) Through-Silicon Vias (TSV) Abstraction Protocol DRAM DRAM DRAM DRAM Many Buses > 1Tb/ s Processor Logic Die High-Speed Links Notes: Tb/ s = Terabits / second HMC height is exaggerated 2011 Micron Technology, Inc. 6

HMC Near Memory MCM Configuration All links are between host CPU and HMC logic layer Maximum bandwidth per GB capacity TSVs Wide Data Path High-Speed Link DRAM CPU Logic Chip Notes: MCM = multi-chip module I llustrative purposes only; height is exaggerated 2011 Micron Technology, Inc. 7

Far memory HMC Far Memory Some HMC links connect to host, some to other cubes Serial links form networks of cubes the memory = the network Scalable to meet system requirements Can be in module form or soldereddown Can form a variety of topologies e.g., tree, ring, double-ring, mesh Future interfaces Higher speed electrical (SERDES) Optical Whatever the most appropriate interface for the job August 4, 2011 2011 Micron Technology, Inc. 8

Processor Memory I nteraction Yesterday: multi-core CPU direct connection to DRAM-specific buses Complex scheduler, deep queues, high reordering especially writes Many DRAM timing parameters standardized across vendors Worst case everything Result is conservative, evolutionary, uncreative, slow performance growth Now: direct connect to HMC logic chip via abstracted high-speed interface No need for complex scheduler, just thin arbiter, shallow queues Only the high-speed interface, protocol, form-factor might be standardized NO TI MI NG constraints, overrun is prevented Great innovations can occur under the hood Maximize performance growth Logic layer flexibility allows HMC cubes to be designed for multiple platforms and applications without changing the high-volume DRAM HMC takes requests, delivers results in most advantageous order August 4, 2011 2011 Micron Technology, Inc. 9

HMC Architecture Start with a clean slate DRAM August 11 2011 Micron Technology, Inc. 10

HMC Architecture Re-partition the DRAM and strip away the common logic DRAM August 11 2011 Micron Technology, Inc. 11

HMC Architecture Stack multiple DRAMs DRAM August 11 2011 Micron Technology, Inc. 12

HMC Architecture Re-insert common logic on to the Logic Base die 3DI & TSV Technology DRAM7 DRAM6 DRAM5 DRAM4 DRAM3 DRAM2 DRAM1 DRAM0 Logic Chip DRAM Vertical Slice Vertical Slices are managed to maximize overall device availability Optimized management of energy and refresh Self test, error detection, correction, and repair in the logic base layer Logic Base August 11 2011 Micron Technology, Inc. 13

Logic Base Slice Control Slice Control Slice Control Slice Control Memory Control Crossbar Switch HMC Architecture Add sophisticated switching and optimized memory control And now we have a whole new set of capabilities Link Interface Controller Link Interface Controller Processor Links Link Interface Controller Logic Base Wide, high-speed local bus for data movement Advanced memory controller functions DRAM control at memory rather than distant host controller Reduced memory controller complexity and increased efficiency Link Interface Controller Vertical Slice 3DI & TSV Technology Vertical Slices are managed to maximize overall device availability DRAM7 DRAM6 DRAM5 DRAM4 DRAM3 DRAM2 DRAM1 DRAM0 Logic Chip Optimized management of energy and refresh DRAM Self test, error detection, correction, and repair in the logic base layer Logic Base August 11 2011 Micron Technology, Inc. 14

Vastly More Responders Conventional DRAM DI MM example: 8 devices, 8 banks/ device Banks of all devices run in lock-step One of 8 potential responders will answer a typical request Not only does HMC give excellent concurrency HMC gen 1 example: 4 DRAMs * 16 slices * 2 banks = 128 One of 128 potential responders will answer a typical request Double that to 256 if 8 DRAMs are in the stack Vast improvement in response to random request stream Significant impact on system latency DRAM trc is lower by design Lower queue delays and higher bank availability further shortens latency Serial links slightly increase latency Net effect is a substantial system latency reduction August 4, 2011 2011 Micron Technology, Inc. 15

Available Total Bandwidth Link 1 2 3 4 5..16 1 2 3 4 5..16 16 Transmit Lanes ( differential) 16 Receive Lanes ( differential) 10 Gbps/ Lane 32 Lanes = 4 bytes 40 GBps per Link 4 Links per Cube 160 GBps per Cube 8 Links per Cube 320 GBps per Cube August 4, 2011 2011 Micron Technology, Inc. 16

RAS Features Reliability, Availability, and Serviceability (RAS) features are built into HMC Systemic RAS features Array repair DRAM/ Logic I O interface repair I nternal ECC is utilized for detection and short term correction Upon detection, repair is scheduled so that ECC is not a perpetual crutch for an identified issue and is free to cover future errors Link I O has means to detect communication errors Worst case hard failure shuts down only one of many links 2011 Micron Technology, Inc. 17

I nternal DRAM Efficiency 4-8 High Stacks 100 % Memory Bandwidth Utilization (of 160GB/s) 90 80 70 60 50 40 30 20 10 0 50% 53% 55% 57% 67% 75% 100% 1:1 2:1 3:1 % Reads of Overall Traffic 4 High % Memory Bandwidth Utilization (of 160GB/s) 128 Byte 64 Byte 32 Byte 100 90 80 70 60 50 40 30 20 10 8 High 128 Byte 64 Byte 32 Byte 0 50% 53% 55% 57% 67% 75% 100% 1:1 2:1 3:1 % Reads of Overall Traffic 2011 Micron Technology, Inc. 18

Effective Read/ Write External Link Efficiency 2011 Micron Technology, Inc. 19

Other Unique System Capabilities Request rate capability 3.2G operations per second (32 byte data transactions) Limited by external data links 2.3Gops at 128GB/ s, 2.9Gops at 160GB/ s Random request capability Historically this gets worse as bandwidth is increased DDR3 SDRAM is ~ 29% at BL8, DDR4 is worse, GDDR5 is worse yet HMC achieves 75% of peak bandwidth (64B data transactions) Network of HMCs E.g., double-ring or mesh of interconnected HMCs Average latency grows but is a constant regardless of the individual HMC s depth in the mesh 2011 Micron Technology, Inc. 20

HMC Gen1 : Technology Comparison Generation 1 ( 4 + 1 memory configuration) Technology VDD I DD BW GB/ s Power ( W) mw/ GB/ s pj/ bit real pj/ bit SDRAM PC133 1GB Module 3.3 1.50 1.06 4.96 4664.97 583.12 762 DDR-333 1GB Module 2.5 2.19 2.66 5.48 2057.06 257.13 245 DDRII-667 2GB Module 1.8 2.88 5.34 5.18 971.51 121.44 139 DDR3-1333 2GB Module 1.5 3.68 10.66 5.52 517.63 64.70 52 DDR4-2667 4GB Module 1.2 5.50 21.34 6.60 309.34 38.67 39 HMC, 4 DRAM w/ Logic 1.2 9.23 128.00 11.08 86.53 10.82 13.7 1Gb 50nm DRAM Array 90nm prototype logic 512MB total DRAM cube 128GB/ s Bandwidth 27mm x 27mm prototype Functional demonstrations! Reduced host CPU energy Simple calculation from I DD7 (SDRAM I DD4) HMC Gen 1 DRAM Real system, some with lower density modules August 11 2011 Micron Technology, Inc. 21

HMC Demonstration Platform August 4, 2011 2011 Micron Technology, Inc. 22

Summary Revolutionary DRAM performance improvement demonstrated by Changing to abstracted high-speed buses Employing 3D packaging using a hybrid of DRAM and logic technologies Pulling in and improving the DRAM controller Marrying DRAM and logic together using many TSVs Completely rethinking DRAM architecture to exploit 3D Managing component health for robust system solutions Result is > 10x bandwidth, < 1/ 3 energy, lower latency Request rates far beyond 2 billion operations per second Logic layer flexibility allows HMC to be tailor-made for multiple platforms and applications Scalable to ANY performance level. I magine the possibilities. August 4, 2011 2011 Micron Technology, Inc. 23

Energy x Bandwidth I mprovement Per DRAM Generation 35 30 25 HMC 20 15 10 5 0 DDR DDR II DDR3 DDR4 HMC Processor High Speed Links Logic Layer 2011 Micron Technology, Inc. 24