Couture: Tailoring STT-MRAM for Persistent Main Memory. Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung

Similar documents
Couture: Tailoring STT-MRAM for Persistent Main Memory

Couture: Tailoring STT-MRAM for Persistent Main Memory

Area, Power, and Latency Considerations of STT-MRAM to Substitute for Main Memory

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Architectural Aspects in Design and Analysis of SOTbased

AC-DIMM: Associative Computing with STT-MRAM

ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Loadsa 1 : A Yield-Driven Top-Down Design Method for STT-RAM Array

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Cascaded Channel Model, Analysis, and Hybrid Decoding for Spin-Torque Transfer Magnetic Random Access Memory (STT-MRAM)

Energy-Efficient Spin-Transfer Torque RAM Cache Exploiting Additional All-Zero-Data Flags

Unleashing MRAM as Persistent Memory

Don t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches

Phase Change Memory An Architecture and Systems Perspective

Magnetoresistive RAM (MRAM) Jacob Lauzon, Ryan McLaughlin

DESIGNING LARGE HYBRID CACHE FOR FUTURE HPC SYSTEMS

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Emerging NVM Memory Technologies

The Engine. SRAM & DRAM Endurance and Speed with STT MRAM. Les Crudele / Andrew J. Walker PhD. Santa Clara, CA August

Improving Energy Efficiency of Write-asymmetric Memories by Log Style Write

VLSID KOLKATA, INDIA January 4-8, 2016

Design-Induced Latency Variation in Modern DRAM Chips:

I/O CANNOT BE IGNORED

CIRCUIT AND ARCHITECTURE CO-DESIGN OF STT-RAM FOR HIGH PERFORMANCE AND LOW ENERGY. by Xiuyuan Bi Master of Science, New York University, 2010

Phase Change Memory An Architecture and Systems Perspective

Constructing Large and Fast On-Chip Cache for Mobile Processors with Multilevel Cell STT-MRAM Technology

Memory technology and optimizations ( 2.3) Main Memory

Improving DRAM Performance by Parallelizing Refreshes with Accesses

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

A Low-Power Hybrid Magnetic Cache Architecture Exploiting Narrow-Width Values

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Cache/Memory Optimization. - Krishna Parthaje

Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs

Developing a Prototyping Board for Emerging Memory

Unleashing the Power of Embedded DRAM

CONTENT-AWARE SPIN-TRANSFER TORQUE MAGNETORESISTIVE RANDOM-ACCESS MEMORY (STT-MRAM) CACHE DESIGNS

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Virtualized and Flexible ECC for Main Memory

ABSTRACT. Mu-Tien Chang Doctor of Philosophy, 2013

Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University

Performance Enhancement Guaranteed Cache Using STT-RAM Technology

COMPUTER ARCHITECTURES

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Designing Giga-scale Memory Systems with STT-RAM

Test and Reliability of Emerging Non-Volatile Memories

ECE 341. Lecture # 16

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

I/O CANNOT BE IGNORED

SOLVING MANUFACTURING CHALLENGES AND BRINGING SPIN TORQUE MRAM TO THE MAINSTREAM

A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing SRAM,

Evaluation of Hybrid Memory Technologies using SOT-MRAM for On-Chip Cache Hierarchy

Steven Geiger Jackson Lamp

Maximize energy efficiency in a normally-off system using NVRAM. Stéphane Gros Yeter Akgul

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

MRAM Developer Day 2018 MRAM Update

Spin-Hall Effect MRAM Based Cache Memory: A Feasibility Study

Technology in Action

Lecture-14 (Memory Hierarchy) CS422-Spring

Title: Redesigning Pipeline When Architecting STT-RAM as Registers in Rad-Hard Environment

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Lecture 5: Refresh, Chipkill. Topics: refresh basics and innovations, error correction

An Architecture-level Cache Simulation Framework Supporting Advanced PMA STT-MRAM

STT-RAM has recently attracted much attention, and it is

Computer Architecture Memory hierarchies and caches

Embedded Memories. Advanced Digital IC Design. What is this about? Presentation Overview. Why is this important? Jingou Lai Sina Borhani

Addressing the Memory Wall

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

Architecture for Carbon Nanotube Based Memory (NRAM)

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

A Power and Temperature Aware DRAM Architecture

An introduction to SDRAM and memory controllers. 5kk73

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

IMME256M64D2SOD8AG (Die Revision E) 2GByte (256M x 64 Bit)

EE414 Embedded Systems Ch 5. Memory Part 2/2

IMME256M64D2DUD8AG (Die Revision E) 2GByte (256M x 64 Bit)

Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache

Evalua&ng STT- RAM as an Energy- Efficient Main Memory Alterna&ve

arxiv: v2 [cs.ar] 15 May 2017

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Nonblocking Memory Refresh. Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian

CS311 Lecture 21: SRAM/DRAM/FLASH

IEEE TRANSACTIONS ON COMPUTERS, VOL. XX, NO. X, AUGUST TA-LRW: A Replacement Policy for Error Rate Reduction in STT-MRAM Caches

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

COMP2121: Microprocessors and Interfacing. Introduction to Microprocessors

Memory Systems IRAM. Principle of IRAM

CSEE W4824 Computer Architecture Fall 2012

a) Memory management unit b) CPU c) PCI d) None of the mentioned

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

An Efficient STT-RAM Last Level Cache Architecture for GPUs

AOS: Adaptive Overwrite Scheme for Energy-Efficient MLC STT-RAM Cache

Transcription:

Couture: Tailoring STT-MRAM for Persistent Main Memory Mustafa M Shihab Jie Zhang Shuwen Gao Joseph Callenes-Sloan Myoungsoo Jung

Executive Summary Motivation: DRAM plays an instrumental role in modern computers, serving as the exclusive main memory technology. But process scaling has exposed it to high leakage and refresh power consumption. STT-MRAM can be an excellent replacement for DRAM, given its high endurance and near-zero leakage. Couture USENIX Inflow 2016 2

Executive Summary Motivation: DRAM plays an instrumental role in modern computers, serving as the exclusive main memory technology. But process scaling has exposed it to high leakage and refresh power consumption. STT-MRAM can be an excellent replacement for DRAM, given its high endurance and near-zero leakage. Challenge: Conventional STT-MRAM cannot directly substitute DRAM, because of - Large cell area - High write energy Couture USENIX Inflow 2016 3

Executive Summary Motivation: DRAM plays an instrumental role in modern computers, serving as the exclusive main memory technology. But process scaling has exposed it to high leakage and refresh power consumption. STT-MRAM can be an excellent replacement for DRAM, given its high endurance and near-zero leakage. Challenge: Conventional STT-MRAM cannot directly substitute DRAM, because of - Large cell area - High write energy Solution: We propose, Couture, a tailored STT-MRAM based memory that offers - DRAM-comparable storage density - High performance with low write energy - Intelligent data scrubbing (iscrub) to ensure data integrity with minimum overhead Couture USENIX Inflow 2016 4

Executive Summary Motivation: DRAM plays an instrumental role in modern computers, serving as the exclusive main memory technology. But process scaling has exposed it to high leakage and refresh power consumption. STT-MRAM can be an excellent replacement for DRAM, given its high endurance and nearzero leakage. Challenge: Conventional STT-MRAM cannot directly substitute DRAM, because of - Large cell area - High write energy Solution: We propose, Couture, a tailored STT-MRAM based memory that offers - DRAM-comparable storage density - High performance with low write energy - Intelligent data scrubbing method (iscrub) to ensure data integrity with minimum overhead Results: Compared to a contemporary DRAM, our proposed Couture can - Achieve up to 23% performance improvement - Consume 18% less energy, on average Couture USENIX Inflow 2016 5

Main Memory DRAM Cache: Fastest Most Expensive Smallest Main Memory Storage: Slowest Least Expensive Largest Main memory ensures optimal performance-cost balance by bridging the gap between on-chip cache and storage DRAM has been the ubiquitous choice for main memory - most often incarnated as DIMMs Applications are getting increasingly data-intensive, demanding larger memory. So, DRAM has to scale-down to process technologies that can hurt its performance Scaling down DRAM cells increases off-state leakage Leakage results in frequent refresh operations that burden DRAM with extra latency and power overhead. Particularly critical for HPCs and data-centers with TBs of memory Couture USENIX Inflow 2016 6

STT-MRAM: Opportunity and Challenges Spin Transfer Torque Magnetoresistive RAM (STT-MRAM) stores data in a magnetic tunnel junction (MTJ) core Physical configuration of the MTJ determines data type i.e., 0 or 1. Data can only be written by driving current to change MTJ configuration. This is both good and bad! In absence of write current there is no switching. This means there is no scope for off-state leakage X Physical switching of the MTJ requires a large current. This makes the write energy high X Large write current requires a large access transistor to drive it, increasing the cell area. A typical STTMRAM cell ( 40F 2 ) is roughly 6X larger than a DRAM cell ( 6F 2 ) Couture USENIX Inflow 2016 7

Tailored STT-MRAM: Critical Design Parameters Couture proposes to reduce the STT-MRAM cell area and write energy by exploiting the design parameters for the MTJ core Thermal Stability Factor ( ): Refers to the stability of an MTJ s magnetic orientations E b = energy barrier, T = temperature, H K = anisotropic field, M S = saturation magnetization, k B = Boltzmann constant, and V = MTJ s volume Critical current (I C ): Minimum current for switching the polarity of MTJ s free layer and = fitting constants that represent the operational environment Retention time (T Retention ): The expected time before a random bit-flip occurs f 0 is the operating frequency Couture USENIX Inflow 2016 8

Access Transistor Optimization STT-MRAM cell area is dominated by the access transistor, as MTJ is comparatively smaller in size The transistor size is mainly determined by the current driven by it. So, to make the transistor smaller, we need to reduce the critical current We reduce the MTJ thickness to lower the thermal stability factor, which in turn lowers the critical current As we lowered the thermal stability factor from 40.29 to 28.91 by reducing MTJ volume, the cell area is reduced from 36 F 2 to 10 F 2 Unfortunately, lowering the thermal stability factor also shortens the retention time of our tailored STT-MRAM Couture USENIX Inflow 2016 9

Write Energy Optimization Lowering the critical current allows us to lower the STT-MRAM write current. This in turn can reduce the write power by 50% The lowered write current results in a marginally increased the write latency But, reducing the write current can improve overall write energy by 33% Lowering the thermal stability factor from its default value of 40.3 to 28.9, STT-MRAM s write energy decreases from 0.66 pj to 0.44 pj Couture USENIX Inflow 2016 10

Couture Memory Module Rank 2 Rank 1 Global Address Decoder Local Adrs Dec Local Adrs Dec For a smooth transition from DRAM, our proposed Couture fits in the existing main memory DIMM packaging Couture Chip ECC Global Address Bus On-Chip Control Logic DDR Channel Global Data Bus Banks Subarray n Local Row Buffer WL Local Row Buffer Subarray 1 Local Row Buffer Row Buffer Sensing Circuit Global Row Buffer 2 Ranks in the Module >> 8 memory Chips (+1 ECC Chip) per Rank 8 logical Banks per Rank >> Each Bank is divided into subarrays of Cells Each Cell contains an Access Transistor and an MTJ, connected via a bit-line (BL), a source-line (SL), and a word-line (WL) Couture USENIX Inflow 2016 11

Dual Write Driver and Sensing Circuit STT-MRAM requires opposing current paths for writing 0 and 1, as the MTJ needs to be switched from anti-parallel to parallel configuration Couture connects the cells to two write-input drivers - W0_En and W1_En Couture s sensing circuit detects the content of a cell by comparing it with a reference cell that has an MTJ that is set at a resistance level exactly in the middle of the resistance spectrum: R L+ R H 2 Couture USENIX Inflow 2016 12

Intelligent Data Scrub (iscrub) Cell-level optimization reduces the data retention time in tailored STT-MRAM But, data retention in STT-MRAM is actually probabilistic in nature Even with a fixed retention time, some of the cells can retain data for a longer time, while others can lose it before the set period iscrub a reinforcement learning based scrub scheduler that can exploit such probabilistic behavior and ensure data integrity with minimum overhead Couture Memory Actions A(t+1): Read, Write, Scrub Rewards R(t): Immediate: Maintain BER Long-term: Minimize Scrubbing iscrub Scheduler States S(t): Scrub freq, BER, Time Reinforcement learning: learns and improve with experience Interacts with its environment over time and senses the current state (S) of its environment and executes an action (A) that produces a reward (R) Goal is to maximize the cumulative reward by learning an optimal policy for mapping states to actions and gradually update a state-action-reward table Couture USENIX Inflow 2016 13

iscrub: Scheduling Algorithm Access next page for scrubbing Select page for scrubbing Access corresponding bank Perform scrub operation More pages to scrub in bank? Yes No No I/O scheduled for bank? Yes Release bank for one I/O Couture USENIX Inflow 2016 14

Evaluation Setup Simulated configurations: DDR3 DRAM: DRAM memory with periodic refresh operations STT-MRAM: Main memory design with conventional STTMRAM (10yr retention) Couture: Proposed Couture design without iscrub scheduler Couture-i: Optimal configuration for our Couture design with iscrub scheme Couture USENIX Inflow 2016 15

Performance Analysis - IPC Instructions per cycle (IPC) for the 4 memory configurations, normalized to that of the DDR3 DRAM baseline STT-MRAM eliminated refresh but average IPC fell below the baseline by 17% Because of the long write latency Lowering the write latency, Couture improved IPC by 8%, on average With iscrub, Couture-i improved the average IPC by 16% over DDR3 DRAM, peaking at 23% for the bzip2 and hmmer benchmarks Couture USENIX Inflow 2016 16

Performance Analysis - Throughput Second performance comparison in terms of I/O throughput STT-MRAM fell short of the baseline DDR3 DRAM by 5%, on average Couture exceeded the baseline s throughput by 8%, on average Couture-i performed best with an average improvement of 13% over the DDR3 DRAM baseline Couture USENIX Inflow 2016 17

Energy Improvement Energy consumption results: Normalized to the DDR3 DRAM baseline 5 components: standby, activation, read, write, and refresh DRAM consumed a significant energy for standby (leakage) and refresh STT-MRAM eliminated refresh and reduced standby energy, but the high write energy increases the overall consumption by 9.5% Couture, with optimized write current, shows an average improvement of 14% Couture-i reduced scrub energy with iscrub and further lowered total energy Peak reduction: 20% (for bzip2), Average reduction: 18% Couture USENIX Inflow 2016 18

Thank You

Backup Slides

iscrub Algorithm Couture USENIX Inflow 2016 21

Evaluation Process We build Couture latency model by considering cell-level and peripheral circuit latency We calibrated CACTI to get reasonable peripheral latency of main memory. For STT-MRAM s subarray access latency, we collect the data by modifying NVSim For system-level evaluation, we integrate Couture latency model in the gem5 We verify the performance of our Couture with ten workload applications from the SPEC2006 benchmark suite. Couture USENIX Inflow 2016 22