Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories

Similar documents
Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Staged Memory Scheduling

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Evalua&ng STT- RAM as an Energy- Efficient Main Memory Alterna&ve

Design-Induced Latency Variation in Modern DRAM Chips:

Emerging NVM Memory Technologies

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

CS311 Lecture 21: SRAM/DRAM/FLASH

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Phase Change Memory An Architecture and Systems Perspective

Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

CSE502: Computer Architecture CSE 502: Computer Architecture

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Phase Change Memory An Architecture and Systems Perspective

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques

HeatWatch Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu

18-740: Computer Architecture Recitation 5: Main Memory Scaling Wrap-Up. Prof. Onur Mutlu Carnegie Mellon University Fall 2015 September 29, 2015

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Computer Architecture Lecture 24: Memory Scheduling

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

COMPUTER ARCHITECTURES

Nonblocking Memory Refresh. Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery

Memory technology and optimizations ( 2.3) Main Memory

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

William Stallings Computer Organization and Architecture 6th Edition. Chapter 5 Internal Memory

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Scalable High Performance Main Memory System Using PCM Technology

EEM 486: Computer Architecture. Lecture 9. Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

China Big Data and HPC Initiatives Overview. Xuanhua Shi

What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Memory Systems IRAM. Principle of IRAM

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

Understanding The Effects of Wrong-path Memory References on Processor Performance

The Memory Hierarchy. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 3, 2018 L13-1

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Flash Trends: Challenges and Future

Scalable Many-Core Memory Systems Lecture 3, Topic 2: Emerging Technologies and Hybrid Memories

Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture

ECE 152 Introduction to Computer Architecture

Couture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5 Internal Memory

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives

Organization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

The Memory Hierarchy. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Computer Organization. 8th Edition. Chapter 5 Internal Memory

Memory hierarchy Outline

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Technical Notes. Considerations for Choosing SLC versus MLC Flash P/N REV A01. January 27, 2012

A Page-Based Storage Framework for Phase Change Memory

Computer Architecture Lecture 21: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/23/2015

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

Reducing Access Latency of MLC PCMs through Line Striping

The Memory Hierarchy 1

Memory Hierarchy Y. K. Malaiya

Lecture-14 (Memory Hierarchy) CS422-Spring

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

A Prototype Storage Subsystem based on PCM

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Purity: building fast, highly-available enterprise flash storage from commodity components

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

RAIDR: Retention-Aware Intelligent DRAM Refresh

DRAM Main Memory. Dual Inline Memory Module (DIMM)

Memory. Lecture 22 CS301

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

EE414 Embedded Systems Ch 5. Memory Part 2/2

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

A Comparison of Capacity Management Schemes for Shared CMP Caches

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

Transcription:

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu, Norm Jouppi* Carnegie Mellon University * Hewlett-Packard Labs Google, Inc.

Executive Summary Phase-change memory (PCM) is a promising emerging technology More scalable than DRAM, faster than flash Multi-level cell (MLC) PCM = multiple s per cell high density Problem: Higher latency/energy compared to non-mlc PCM Observation: MLC s have asymmetric read/write characteristics Some s can be read quickly but written slowly and vice versa 2

Executive Summary Goal: Read data from fast-read s; write data to fast-write s Solution: Decouple s to expose fast-read/write memory regions Map read/write-intensive data to appropriate memory regions Split device row buffers to leverage decoupling for better locality Result: Improved performance (+19.2%) and energy efficiency (+14.4%) Across SPEC CPU2006 and data-intensive/cloud workloads 3

Outline Background Problem and Goal Key Observations MLC-PCM cell read asymmetry MLC-PCM cell write asymmetry Our Techniques Decoupled Bit Mapping (DBM) Asymmetric Page Mapping (APM) Split Row Buffering (SRB) Results Conclusions 4

Background: PCM Emerging high-density memory technology Potential for scalable DRAM alternative Projected to be 3 to 12x denser than DRAM Access latency within an order or magnitude of DRAM Stores data in the form of resistance of cell material 5

PCM Resistance Value Cell value: 1 0 Cell resistance 6

Background: MLC-PCM Multi-level cell: more than 1 per cell Further increases density by 2 to 4x [Lee+,ISCA'09] But MLC-PCM also has drawbacks Higher latency and energy than single-level cell PCM Let's take a look at why this is the case 7

MLC-PCM Resistance Value Bit 1 Bit 0 Cell value: 11 10 01 00 Cell resistance 8

MLC-PCM Resistance Value Less margin between values need more precise sensing/modification of cell contents higher latency/energy (~2x for reads and 4x for writes) Cell value: 11 10 01 00 Cell resistance 9

Problem and Goal Want to leverage MLC-PCM's strengths Higher density More scalability than existing technologies (DRAM) But, also want to mitigate MLC-PCM's weaknesses Higher latency/energy Our goal in this work is to design new hardware/software optimizations designed to mitigate the weaknesses of MLC-PCM 10

Outline Background Problem and Goal Key Observations MLC-PCM cell read asymmetry MLC-PCM cell write asymmetry Our Techniques Decoupled Bit Mapping (DBM) Asymmetric Page Mapping (APM) Split Row Buffering (SRB) Results Conclusions 11

Observation 1: Read Asymmetry The read latency/energy of Bit 1 is lower than that of Bit 0 This is due to how MLC-PCM cells are read 12

Observation 1: Read Asymmetry Simplified example Capacitor filled with reference voltage MLC-PCM cell with unknown resistance 13

Observation 1: Read Asymmetry Simplified example 14

Observation 1: Read Asymmetry Simplified example Infer data value 15

Observation 1: Read Asymmetry Voltage Time 16

Observation 1: Read Asymmetry Voltage 11 10 01 00 Time 17

Observation 1: Read Asymmetry Voltage Initial voltage (fully charged capacitor) 11 10 01 00 Time 18

Observation 1: Read Asymmetry Voltage PCM cell connected draining capacitor 11 10 01 00 Time 19

Observation 1: Read Asymmetry Voltage Capacitor drained data value known (01) 11 10 01 00 Time 20

Observation 1: Read Asymmetry In existing devices Both MLC s are read at the same time Must wait maximum time to read both s However, we can infer information about Bit 1 before this time 21

Observation 1: Read Asymmetry Voltage 11 10 01 00 Time 22

Observation 1: Read Asymmetry Voltage 11 10 01 00 Time 23

Observation 1: Read Asymmetry Voltage Time to determine Bit 1's value 11 10 01 00 Time 24

Observation 1: Read Asymmetry Voltage Time to determine Bit 0's value 11 10 01 00 Time 25

Observation 2: Write Asymmetry The write latency/energy of Bit 0 is lower than that of Bit 1 This is due to how PCM cells are written In PCM, cell resistance must physically be changed Requires applying different amounts of current For different amounts of time 26

Observation 2: Write Asymmetry Writing both s in an MLC cell: 250ns Only writing Bit 0: 210ns Only writing Bit 1: 250ns Existing devices write both s simultaneously (250ns) 27

Key Observation Summary Bit 1 is faster to read than Bit 0 Bit 0 is faster to write than Bit 1 We refer to Bit 1 as the fast-read/slow-write (FR) We refer to Bit 0 as the slow-read/fast-write (FW) We leverage read/write asymmetry to enable several optimizations 28

Outline Background Problem and Goal Key Observations MLC-PCM cell read asymmetry MLC-PCM cell write asymmetry Our Techniques Decoupled Bit Mapping (DBM) Asymmetric Page Mapping (APM) Split Row Buffering (SRB) Results Conclusions 29

Technique 1: Decoupled Bit Mapping (DBM) Key Idea: Logically decouple FR s from FW s Expose FR s as low-read-latency regions of memory Expose FW s as low-write-latency regions of memory 30

Technique 1: Decoupled Bit Mapping (DBM) MLC-PCM cell Bit 1 (FR) Bit 0 (FW) 31

Technique 1: Decoupled Bit Mapping (DBM) MLC-PCM cell Bit 1 (FR) Bit 0 (FW) Coupled (baseline): Contiguous s alternate between FR and FW 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32

Technique 1: Decoupled Bit Mapping (DBM) MLC-PCM cell Bit 1 (FR) Bit 0 (FW) Coupled (baseline): Contiguous s alternate between FR and FW 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 33

Technique 1: Decoupled Bit Mapping (DBM) MLC-PCM cell Bit 1 (FR) Bit 0 (FW) Coupled (baseline): Contiguous s alternate between FR and FW 1 3 5 7 9 11 13 15 0 2 4 6 8 10 12 14 Decoupled: Contiguous regions alternate between FR and FW 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 34

Technique 1: Decoupled Bit Mapping (DBM) By decoupling, we've created regions with distinct characteristics We examine the use of 4KB regions (e.g., OS page size) Physical address Fast read page Fast write page Want to match frequently read data to FR pages and vice versa Toward this end, we propose a new OS page allocation scheme 35

Technique 2: Asymmetric Page Mapping (APM) Key Idea: predict page read/write intensity and map accordingly Measure write intensity of instructions that access data If instruction has high write intensity and first touches page»os allocates FW page, otherwise, allocates FR page Implementation (full details in paper) Small hardware cache of instructions that often write data Updated by cache controller when data written to memory New instruction for OS to query table for prediction 36

Technique 3: Split Row Buffering (SRB) Row buffer stores contents of currently-accessed data Used to buffer data when sending/receiving across I/O ports Key Idea: With DBM, buffer FR s independently from FW s Coupled (baseline): must use large monolithic row buffer (8KB) DBM: can use two smaller associative row buffers (2x4KB) Can improve row buffer locality, reducing latency and energy Implementation (full details in paper) No additional SRAM buffer storage Requires multiplexer logic for selecting FR/FW buffers 37

Outline Background Problem and Goal Key Observations MLC-PCM cell read asymmetry MLC-PCM cell write asymmetry Our Techniques Decoupled Bit Mapping (DBM) Asymmetric Page Mapping (APM) Split Row Buffering (SRB) Results Conclusions 38

Evaluation Methodology Cycle-level x86 CPU-memory simulator CPU: 8 cores, 32KB private L1/512KB private L2 per core Shared L3: 16MB on-chip edram Memory: MLC-PCM, dual channel DDR3 1066MT/s, 2 ranks Workloads SPEC CPU2006, NASA parallel benchmarks, GraphLab Performance metrics Multi-programmed (SPEC): weighted speedup Multi-threaded (NPB, GraphLab): execution time 39

Comparison Points Conventional: coupled s (slow read, slow write) All-FW: hypothetical all-fw memory (slow read, fast write) All-FR: hypothetical all-fr memory (fast read, slow write) DBM: decouples mapping (50% FR pages, 50% FW pages) DBM+: techniques that leverage DBM (APM and SRB) Ideal: idealized cells with best characteristics (fast read, fast write) 40

System Performance +10% +16% +13% +19% +31% Conventional Normalized Speedup All fast write All fast read DBM DBM+APM+SRB Ideal 41

System Performance +10% +16% +13% +19% +31% Conventional Normalized Speedup All-FR > All-FW dependent on workload access patterns All fast write All fast read DBM DBM+APM+SRB Ideal 42

System Performance +10% +16% +13% +19% +31% Conventional Normalized Speedup DBM allows systems to take advantage of reduced read latency (FR region) and reduced write latency (FW region) All fast write All fast read DBM DBM+APM+SRB Ideal 43

Memory Energy Efficiency +30% +5% +12% +8% +14% Conventional Normalized Performance per Watt All fast write All fast read DBM DBM+APM+SRB Ideal 44

Memory Energy Efficiency +30% +5% +12% +8% +14% Conventional Normalized Performance per Watt Benefits from lower read energy by exploiting read asymmetry (dominant case) and from lower write energy by exploiting write asymmetry All fast write All fast read DBM DBM+APM+SRB Ideal 45

Other Results in the Paper Improved thread fairness (less resource contention) From speeding up per-thread execution Techniques do not exacerbate PCM wearout problem ~6 year operational lifetime possible 46

Outline Background Problem and Goal Key Observations MLC-PCM cell read asymmetry MLC-PCM cell write asymmetry Our Techniques Decoupled Bit Mapping (DBM) Asymmetric Page Mapping (APM) Split Row Buffering (SRB) Results Conclusions 47

Conclusions Phase-change memory (PCM) is a promising emerging technology More scalable than DRAM, faster than flash Multi-level cell (MLC) PCM = multiple s per cell high density Problem: Higher latency/energy compared to non-mlc PCM Observation: MLC s have asymmetric read/write characteristics Some s can be read quickly but written slowly and vice versa 48

Conclusions Goal: Read data from fast-read s; write data to fast-write s Solution: Decouple s to expose fast-read/write memory regions Map read/write-intensive data to appropriate memory regions Split device row buffers to leverage decoupling for better locality Result: Improved performance (+19.2%) and energy efficiency (+14.4%) Across SPEC CPU2006 and data-intensive/cloud workloads 49

Thank You! 50

Efficient Data Mapping and Buffering Techniques for Multi- Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu, Norm Jouppi* Carnegie Mellon University * Hewlett-Packard Labs Google, Inc.

Backup Slides 52

PCM Cell Operation 53

Integrating ADC 54

APM Implementation Program execution. Cache PC table indices ProgCounter Instruction 0x00400f91 mov %r14d,%eax 0x00400f94 movq $0xff..,0xb8(%r13) 0x00400f9f mov %edx,0xcc(%r13) 0x00400fa6 neg %eax 0x00400fa8 lea 0x68(%r13),%rcx Write access index 00 01 10 11 10 PC table PC WBs 0x0040100f 7279 0x00400fbd 11305 0x00400f94 5762 0x00400fc1 4744 Writeback + Memory 55