Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu

Position Statement Emerging NVM are very attractive Combing the speed of SRAM, the density of DRAM, and the non-volatility of Flash memory, Attractive features high density, low leakage, non-volatile Undesirable features: Write-related: long write-latency, high write-energy, low endurance (e.g. PCRAM) Cost (Needs large volume production) Solution: Hybrid cache/mem/storage + 3D? Enabling unique applications 2

Introduction Modeling Outline MRAM/PCRAM modeling Architecture MRAM stacking HCA: Hybrid Cache Architecture Hybrid storage system Application Exascale computing Conclusion 3

Traditional Memory Hierarchies Latency: (Cycles) On-chip memory (SRAM) Off-chip memory (DRAM) 1~30 100~300 Solid State Disk (Flash Memory) Large 25000~2000000 Latency Gap Secondary Storage (HDD) >5000000 4 4

Emerging Memory Techologies FeRAM (Ferroelectric RAM) MRAM (Magnetic RAM) EverSpin MRAM(2008) Toshiba FeRAM(2009) Memristor (Resistive RAM) PCRAM (Phase-Change RAM) HP Labs Memristor (2009) Samsung PCRAM (2008) 5 5

Traditional Memory Hierarchies Latency: (Cycles) On-chip memory (SRAM) Off-chip memory (DRAM) 1~30 100~300 Solid State Disk (Flash Memory) 25000~2000000 Secondary Storage (HDD) >5000000 7 7

NVRAM Comparison FeRAM, MRAM, or PCRAM, combines the advantages of SRAM, DRAM, and flash. Good opportunity to rethink the memory hierarchy design. Courtesy: Motoyuki Ooishi 8 8

What is the impact of emerging NVM technologies on computer memory hierarchies? Traditional Memory Hierarchies Latency: (Cycles) On-chip memory (SRAM) Off-chip memory (DRAM) ~10 ~100 Solid State Disk (SSD) 25000~2000000 Secondary Storage (HDD) >5000000 Phase-change RAM (PCRAM) Magnetic RAM (MRAM) Emerging Non-volatile Memory (NVM) 9

PCRAMsim Model Developed on the basis of CACTI CACTI models SRAM and DRAM caches CACTI does NOT support PCRAM. Precharge & Equalization Memory cells Row Decoders Wordline Drivers 2D array of memory cells PCRAMsim made 3 modifications on the subarray-level Bitline Mux Sense Amplifiers Sense Amplifier Mux Output/Write Drivers Peripheral circuitry CACTI-modeled memory subarray 10 10

SRAM vs. MRAM High Density Fast Read Slow Write Low Read Energy High Write Energy Low Leakage Area (65nm) 3.66mm 2 SRAM 3.30mm 2 MRAM Capacity 128KB 512KB Read latency 2.25ns 2.32ns Write latency 2.26ns 11.02ns Read energy 0.90nJ 0.86nJ Write energy 0.80nJ 5.00nJ Cache configurations Leakage power 2MB (16x128KB) SRAM cache 2.09W 8MB (16x512KB) MRAM cache 0.26W Pros: Low leakage power, high density. Cons: Long write latency and large write energy. Replace SRAM caches with MRAM? 11 11

Direct Replacement Replace SRAM with MRAM of same area. The number of banks are kept the same. The capacity of L2 cache increases by 4X. L2 Cache Miss Rate L2 cache miss rate reduced. How is the performance? 12 12

IPC Comparison (Direct Replacement) IPC (SRAM vs. MRAM) The last four benchmarks have high write intensities. (see Observation 1) 13 13

Observation 1 Replacing SRAM L2 caches directly with MRAM can reduce the access miss rate of L2 caches. However, the long access latency to MRAM cache has a negative impact on the performance. When the write intensity is high, it even results in performance degradation. Direct MRAM replacement may harm performance, How is power consumption? 14 14

Power Analysis (Direct Replacement) (Normalized to 2M-SRAM-SNUCA) MRAM dynamic power MRAM leakage power Total Power (SRAM vs. MRAM) For some workloads, MRAM dynamic power dominates! (see Observation 2) 15 15

Observation 2 Replacing SRAM L2 caches directly with MRAM can greatly reduce the leakage power. When the write intensity is high, the dynamic power increases significantly because of the high write energy of MRAM cache. Question: How to improve the performance and further reduce power of MRAM? 16 16

SRAM-MRAM Hybrid L2 Cache (Write Intensity: Pure vs. Hybrid) Write Intensity (Pure vs. Hybrid) Using hybrid L2 cache, MRAM write intensities are reduced 20 17

IPC Result direct replacement with read-preemptive IPC Comparison the performance degradation is eliminated. The average IPC is increased by 15%. 21 18

Power Result 8M-MRAM-DNUCA direct replacement with read-preemptive Total Power Comparison the dynamic power is reduced. The average total power is further reduced by 17%. 22 19

Comparisons SRAM edram MRAM PRAM Density (ratio) Low (1) High (4) High (4) High(16) Dynamic Power Low Medium Low for read; High for write Reduce Cache miss rate Increase hit latency Leakage Power High Medium Low Low Speed Very Fast Fast Non-volatility No No Yes Yes Scalability Yes Yes Yes Yes 10 16 Fast for read; Slow for write Low leakage power High dynamic power Medium for read; High for write Slow for read; Very slow for write Endurance 10 16 >10 15 10 8 20

No such Ideal (On-size-fits-all) Memory 1.4 1M-SRAM 4M-DRAM 4M-MRAM 16M-PRAM No 1 single memory technology has 0.6 0.2 astar bzip2 gcc gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean Normalized IPC 1 0.8 0.6 0.4 0.2 0 1.88 1.89 the best power-performance Static Dynamic Hybrid Cache may outperform its counterpart of single technology astar bzip2 gcc gobmk h264 hmmer-sp libquantum mcf omnetpp perl sjeng blast bt cg clustalw hmmer lu mg sp ua specjbb dedup fluidanimate freqmine streamcluster Geomean Normalized Power 21

HCA: Hybrid Cache Architecture Core w/ L1s L2 (SRAM) L3 (edram/ MRAM/ PRAM) Core w/ L1s L2 Core w/ L1s L2 Core w/ L1s L2 Core w/ L1s L2 L3 L3 L3 L3 L3 L3 L3 Core w/ L1s L2 Core w/ L1s L2 Core w/ L1s L2 Core w/ L1s L2 2D design scenario Core w/ L1s L2 (SRAM) L3 (edram/ MRAM/ PRAM) A A B LHCA L3 LHCA RHCA Core w/ L1s L2 Fast (SRAM) L2 Slow (edram/ MRAM/ PRAM) 3D design scenario Baseline: a 2D 8-core CMP (3-level SRAM Caches) Flattening L2 and L3 with hybrid cache Core w/ L1s 3DHCA L2 (SRAM) 3D Layer 1 L3 (edram/ MRAM) Core w/ L1s L2 Fast (SRAM) 3DHCA C D L2 Middle E (edram/ MRAM) Core w/ L1s L2 Fast 3DHCA (SRAM) L2 Slow (edram/ MRAM) L4 (PRAM) 3D Layer 2 L2 Slow (PRAM) L3 (PRAM) A cache design scenario with 3D chip integration Flattening L2, L3 and L4 with hybrid cache Flattening L3 and L4 with hybrid cache 22

Hybrid Storage (HPCA 2010) Erase Unit In-place updating Data Region NAND flash How to manage the Log-region efficiently? Data Buffer in Memory Log Region PRAM Physical View Hybrid Architecture Sector (512Bytes) Structural View 23 23

Introduction Modeling Outline MRAM/PCRAM modeling Architecture MRAM stacking HCA: Hybrid Cache Architecture Hybrid storage system Application Exascale computing Conclusion 24

Fault Resiliency for Exascale System Microprocessor becomes unreliable Process scaling, voltage scaling, soft error, NBTI, Even assuming socket MTTF remains constant system MTTF = socket MTTF / number of socket 1 socket Socket MTTF = 5 years Exascale ~100,000 socket System MTTF = 26 minutes 25 25

Checkpoint / Restart Checkpoint / Restart is the state-of-the-art Hard disk drive (HDD) as the checkpoint storage HDD peak bandwidth: ~100MB/s BlueGene/L: 12 mins to take a checkpoint Equivalent to 8% performance loss Scale to exascale... Tolerable Unacceptable! 26 26

PCRAM A Good Candidate PCRAM is 2 orders faster than flash PCRAM has 3 orders higher endurance than flash Good candidate for local checkpoint HDD NAND Flash PCRAM Cell size - 4-6F 2 4-6F 2 Read time ~4ms 5us-50us 10ns-100ns Write time ~4ms 2ms-3ms 100-1000ns Stanby power ~1W ~0W ~0W Endurance 10 15 10 5 10 8 Courtesy: Motoyuki Ooishi 27 27

3D PCRAM How to Integrate PCRAM Deploy PCRAM directly on top of DRAM Possible local bandwidth ~2.5TB/s (DIMM bandwidth ~10GB/s) DRAM Parameters Values Bank size 32MB Mat count 16 Required TSV pitch < 74um PCRAM ITRS TSV pitch projection for 2012 3D-PCRAM delay Equivalent bandwidth 3.8um 0.8ms 2500GB/s Collaboration with HP Labs, Exascale Computing Lab, Dr. Norm Jouppi, SC 2009) 28

Our Projection 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Collaboration with HP Labs, Exascale Computing Lab, Dr. Norm Jouppi, SC 2009) 29 29

More Details http://www.cse.psu.edu/~yuanxie/3d.html Xiangyu Dong, X. Wu, Guangyu Sun, Yuan Xie, H. Li, Y.Chen, Circuit and Microarchitecture Evaluation of 3D MRAM, DAC 2008 Xiangyu Dong, Norm Jouppi, Yuan Xie, PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM ICCAD 2009. G.Sun, X. Dong, Y. Xie, J. Li, Y. Chen, Novel MRAM-Stacking Architecture for CMP, HPCA 2009 Xiaoxia Wu, J. Li, L. Zhang, E. Speight, Yuan Xie. Hybrid Cache Architecture with Disparate Memory Technologies." ISCA 2009 Guangyu Sun, Y. Joo, Y. Chen, Yuan Xie, Y. Chen, H. Li, A Hybrid Solid-State Storage Architecture for Performance, Energy Consumption and Lifetime Improvement. HPCA 2010. Y.Joo, D.Niu, Guangyu Sun, Xiangyu Dong, Y. Xie, Energy- and Endurance-Aware Design of PCRAM Caches." DATE. 2010. Xiangyu Dong, N. Muralimanohar, Norm Jouppi, Richard Kaufmann, Yuan Xie, Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems SC 2009. 30

Conclusion Emerging NVM are very attractive Combing the speed of SRAM, the density of DRAM, and the non-volatility of Flash memory, Attractive features high density, low leakage, non-volatile Undesirable features: Write-related: long write-latency, high write-energy, low endurance (e.g. PCRAM) Cost (Needs large volume production) Solution: Hybrid cache/mem/storage + 3D? Enabling unique applications 31