Scalable High Performance Main Memory System Using PCM Technology

Similar documents
Scalable Many-Core Memory Systems Lecture 3, Topic 2: Emerging Technologies and Hybrid Memories

Onyx: A Prototype Phase-Change Memory Storage Array

Phase Change Memory An Architecture and Systems Perspective

Energy-Aware Writes to Non-Volatile Main Memory

Phase Change Memory: Replacement or Transformational

The Role of Storage Class Memory in Future Hardware Platforms Challenges and Opportunities

Exploring the Potential of Phase Change Memories as an Alternative to DRAM Technology

Phase-change RAM (PRAM)- based Main Memory

Phase Change Memory An Architecture and Systems Perspective

Will Phase Change Memory (PCM) Replace DRAM or NAND Flash?

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

A Hybrid Solid-State Storage Architecture for the Performance, Energy Consumption, and Lifetime Improvement

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Evaluating Phase Change Memory for Enterprise Storage Systems

Memory technology and optimizations ( 2.3) Main Memory

A Page-Based Storage Framework for Phase Change Memory

Preface. Fig. 1 Solid-State-Drive block diagram

Emerging NVM Memory Technologies

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

A Prototype Storage Subsystem based on PCM

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

PreSET: Improving Performance of Phase Change Memories by Exploiting Asymmetry in Write Times

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Hybrid Memory Platform

Flash Trends: Challenges and Future

Flash-Conscious Cache Population for Enterprise Database Workloads

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

I/O CANNOT BE IGNORED

Advanced Computer Architecture

SNIA Tutorial 1 A CASE FOR FLASH STORAGE HOW TO CHOOSE FLASH STORAGE FOR YOUR APPLICATIONS

Memory Hierarchy. Slides contents from:

Helmet : A Resistance Drift Resilient Architecture for Multi-level Cell Phase Change Memory System

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Phase Change Memory and its positive influence on Flash Algorithms Rajagopal Vaideeswaran Principal Software Engineer Symantec

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories

AdaMS: Adaptive MLC/SLC Phase-Change Memory Design for File Storage

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

I/O Commercial Workloads. Scalable Disk Arrays. Scalable ICDA Performance. Outline of This Talk: Related Work on Disk Arrays.

Test and Reliability of Emerging Non-Volatile Memories

Toward a Memory-centric Architecture

Is Buffer Cache Still Effective for High Speed PCM (Phase Change Memory) Storage?

Intel s s Memory Strategy for the Wireless Phone

Advanced Memory Organizations

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Chapter 12 Wear Leveling for PCM Using Hot Data Identification

COMPUTER ORGANIZATION AND DESIGN

18-740: Computer Architecture Recitation 4: Rethinking Memory System Design. Prof. Onur Mutlu Carnegie Mellon University Fall 2015 September 22, 2015

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives

Lecture-14 (Memory Hierarchy) CS422-Spring

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Compiler-Assisted Refresh Minimization for Volatile STT-RAM Cache

Designing a Fast and Adaptive Error Correction Scheme for Increasing the Lifetime of Phase Change Memories

Memory Hierarchy. Slides contents from:

Architectural Aspects in Design and Analysis of SOTbased

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Presented by: Nafiseh Mahmoudi Spring 2017

V. Primary & Secondary Memory!

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Caches. Hiding Memory Access Times

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Storage and Memory Infrastructure to Support 5G Applications. Tom Coughlin President, Coughlin Associates

Course Administration

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Emerging NV Storage and Memory Technologies --Development, Manufacturing and

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Morphable Memory System: A Robust Architecture for Exploiting Multi-Level Phase Change Memories

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Chapter 13: I/O Systems

Architecture Exploration of High-Performance PCs with a Solid-State Disk

Computer Architecture 计算机体系结构. Lecture 6. Data Storage and I/O 第六讲 数据存储和输入输出. Chao Li, PhD. 李超博士

Performance Modeling and Analysis of Flash based Storage Devices

The Google File System

Memory Hierarchy Y. K. Malaiya

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Enabling Cost-effective Data Processing with Smart SSD

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Using Transparent Compression to Improve SSD-based I/O Caches

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Transcription:

Scalable High Performance Main Memory System Using PCM Technology Moinuddin K. Qureshi Viji Srinivasan and Jude Rivers IBM T. J. Watson Research Center, Yorktown Heights, NY International Symposium on Computer Architecture (ISCA-2009) Jul-28-09

Main Memory Capacity Wall More cores in system More concurrency Larger working set Demand for main memory capacity continues to increase Main Memory System consisting of DRAM are hitting: 1. Cost wall: Major % of cost of large servers is main memory 2. Scaling wall: DRAM scaling to small technology is challenge 3. Power wall: IBM P670 Server Processor Memory Small (4 proc, 16GB) 384 Watts 314 Watts Large (16 proc, 128GB) 840 Watts 1223 Watts Source: Lefurgy et al. IEEE Computer 2003 Need a practical solution to increase main-memory capacity 2

The Technology Hierarchy More capacity by cheaper, denser, (slower) technology Memory System High-Performance Disk L1(SRAM) EDRAM DRAM PCM Flash HDD 2 1 2 3 2 5 2 7 2 9 2 11 2 13 2 15 2 17 2 19 2 21 2 23 Typical access latency in processor cycles (@ 4 GHz) Phase Change Memory (PCM) promising candidate for large capacity main memory 3

Outline Introduction What is PCM? Hybrid Memory System Evaluation Lifetime Analysis Summary 4

What is Phase Change Memory? Phase change material (chalcogenide glass) exists in two states: 1. Amorphous: high resistivity 2. Crystalline: low resistivity Materials can be switched between states reliably, quickly, large number of times Bit Line PCM stores data in terms of resistance Low resistance (SET state) = 1 High resistance (RESET state) = 0 N Word Line N Word Line I N 5

How does PCM work? Switching by heating using electrical pulses SET: sustained current to heat cell above Tcryst RESET: cell heated above Tmelt and quenched Temperature RESET SET T melt T cryst Large Current Small Current Time [ns] Memory Element SET Low resistance 10 3-10 4 Ω Access Device RESET High resistance 10 6-10 7 Ω 6 Photo Courtesy: Bipin Rajendran, IBM

Key Characteristics of PCM + Scales better than DRAM, small cell size Prototypes as small as 3nm x 20 nm fabricated and tested [Raoux+ IBMJRD 08] + Can store multiple bits/cell More density in the same area Prototypes with 2 bits/cell in ISSCC 08. >2 bits/cell expected soon. + Non-Volatile Memory Technology Data retention of 10 years Power implications, system implications Challenges: - More latency compared to DRAM. - Limited Endurance (~10 million writes per cell) - Write bandwidth constrained, so better to write less often. 7

Outline Introduction What is PCM? Hybrid Memory System Evaluation Lifetime Analysis Summary 8

Hybrid Memory System PCM Main Memory DATA W Processor DRAM Buffer T DATA Flash Or HDD T=Tag-Store PCM Write Queue Hybrid Memory System: 1. DRAM as cache to tolerate PCM Rd/Wr latency and Wr bandwidth 2. PCM as main-memory to provide large capacity at good cost/power 9

Lazy Write Architecture Problem: Double PCM writes to dirty pages on install DRAM Buffer PCM Processor Flash/Disk WRQ For example: Daxpy Kernel: Y[i] = Y[i] + X[i] Baseline has 2 writes for Y[i] and 1 for X[i] Lazy write has 1 write for Y[i] and 1 for X[i] 10

Line Level Write Back Problem: Not all lines in a dirty page are dirty Solution: Dirty bits per line in DRAM buffer and write-back only dirty lines from DRAM to PCM Num Writes to Each Line (Mln) Line_id db1 Average Average 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 db2 Problem: With LLWB, not all lines in dirty pages are written uniformly 11

Fine Grained Wear Leveling Solution: Fine Grained Wear Leveling (FGWL) -When a page gets allocated page is rotated by a random shift value -The rotate value remains constant while page remains in memory -On replacement of a page, a new random value is assigned for a new page -Over time, the write traffic per line becomes uniform. Num Writes to Each Line (Mln) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 db2 Line_id db1 Average Average FGWL makes writes across lines in a dirty page uniform 12

Outline Introduction What is PCM? Hybrid Memory System Evaluation Lifetime Analysis Summary 13

Evaluation Framework Trace Driven Simulator: 16-core system (simple core), 8GB DRAM main-memory at 320 cycles HDD (2 ms) with Flash (32 us) with Flash hit-rate of 99% Workloads: Database workloads & Data parallel kernels 1. Database workloads: db1 and db2 2. Unix utilities: qsort and binary search 3. Data Mining : K-means and Gauss Seidal 4. Streaming: DAXPY and Vector Dot Product Assumption: PCM 4X denser & 4X slower than DRAM 32GB @ 1280 cycle read latency 14

Reduction in Page Faults Benefit from capacity Need >16GB Streaming 15

Impact on Execution Time PCM with DRAM buffer performs similar to equal capacity DRAM storage 16

Impact of PCM Latency DRAM-8GB PCM-32GB HYBRID (1+32)GB DRAM-32GB Hybrid memory system is relatively insensitive to PCM Latency 17

Power Evaluations Significant Power and Energy savings with PCM based hybrid memory system 18

Outline Introduction What is PCM? Hybrid Memory System Evaluation Lifetime Analysis Summary 19

Impact of Write Endurance B Bytes/Cycle written to PCM S PCM capacity in bytes Wmax Max writes per PCM cell Assuming uniform writes to PCM Endurance (in cycles) = (S/B).Wmax F Frequency of System (4GHz) Y = Number of years (lifetime) There are 2 25 seconds in a year Num. cycles in Y years = Y. F.2 25 Y = (S/B). Wmax F.2 25 For a 4GHz System, a 32GB PCM written at 1 Byte per Cycle Y = Wmax 4 million If Wmax = 10 million, PCM will last for 2.5 years 20

Lifetime Results Table shows average bytes per cycle written to PCM and Average lifetime of PCM assuming Wmax = 10 million Configuration Avg. Bytes/Cycle Avg. Lifetime 1GB DRAM + 32GB PCM 0.807 3.0 yrs + Lazy Write 0.725 3.4 yrs + Line Level Write Back 0.316 7.6 yrs + Bypass Streaming Apps 0.247 9.7 yrs Proposed filtering techniques reduce write traffic to PCM by 3.2X, increasing its lifetime from 3 to 9.7 years 21

Outline Introduction What is PCM? Hybrid Memory System Evaluation Lifetime Analysis Summary 22

Summary Need more main memory capacity: DRAM hitting power, cost, scaling wall PCM is an emerging technology 4x denser than DRAM but with slower access time and limited write endurance We propose a Hybrid Memory System (DRAM+PCM) that provides significant power and performance benefits Proposed write filtering techniques reduce writes by 3x and increase PCM lifetime from 3 years to 9 years Not touched in this talk but important: Exploiting non-volatile memories for system enhancement & related OS issues. 23

Thanks! 24