Opportunities and Challenges of Emerging Memory Technologies

Size: px

Start display at page:

Download "Opportunities and Challenges of Emerging Memory Technologies"

Adrian McDonald
5 years ago
Views:

1 Opportuities ad Challeges of Emergig Memory Techologies Our Mutlu September 11, 2017 ARM Research Summit

2 The Mai Memory System Processors ad caches Mai Memory Storage (SSD/HDD) Mai memory is a critical compoet of all computig systems: server, mobile, embedded, desktop, sesor Mai memory system must scale (i size, techology, efficiecy, cost, ad maagemet algorithms) to maitai performace growth ad techology scalig beefits 2

3 The Mai Memory System FPGAs Mai Memory Storage (SSD/HDD) Mai memory is a critical compoet of all computig systems: server, mobile, embedded, desktop, sesor Mai memory system must scale (i size, techology, efficiecy, cost, ad maagemet algorithms) to maitai performace growth ad techology scalig beefits 3

4 The Mai Memory System GPUs Mai Memory Storage (SSD/HDD) Mai memory is a critical compoet of all computig systems: server, mobile, embedded, desktop, sesor Mai memory system must scale (i size, techology, efficiecy, cost, ad maagemet algorithms) to maitai performace growth ad techology scalig beefits 4

5 Memory System: A Shared Resource View Storage Most of the system is dedicated to storig ad movig data 5

6 State of the Mai Memory System Recet techology, architecture, ad applicatio treds lead to ew reuiremets exacerbate old reuiremets DRAM ad memory cotrollers, as we kow them today, are (will be) ulikely to satisfy all reuiremets Some emergig o-volatile memory techologies (e.g., PCM) eable ew opportuities: memory+storage mergig We eed to rethik the mai memory system to fix DRAM issues ad eable emergig techologies to satisfy all reuiremets 6

7 Major Treds Affectig Mai Memory (I) Need for mai memory capacity, badwidth, QoS icreasig Mai memory eergy/power is a key system desig cocer DRAM techology scalig is edig 7

8 Major Treds Affectig Mai Memory (II) Need for mai memory capacity, badwidth, QoS icreasig Multi-core: icreasig umber of cores/agets Data-itesive applicatios: icreasig demad/huger for data Cosolidatio: cloud computig, GPUs, mobile, heterogeeity Mai memory eergy/power is a key system desig cocer DRAM techology scalig is edig 8

9 Example: The Memory Capacity Gap Core cout doublig ~ every 2 years DRAM DIMM capacity doublig ~ every 3 years Lim et al., ISCA 2009 Memory capacity per core expected to drop by 30% every two years Treds worse for memory badwidth per core! 9

10 Major Treds Affectig Mai Memory (III) Need for mai memory capacity, badwidth, QoS icreasig Mai memory eergy/power is a key system desig cocer ~40-50% eergy spet i off-chip memory hierarchy [Lefurgy, IEEE Computer 03] >40% power i DRAM [Ware, HPCA 10][Paul,ISCA 15] DRAM cosumes power eve whe ot used (periodic refresh) DRAM techology scalig is edig 10

11 Major Treds Affectig Mai Memory (IV) Need for mai memory capacity, badwidth, QoS icreasig Mai memory eergy/power is a key system desig cocer DRAM techology scalig is edig ITRS projects DRAM will ot scale easily below X m Scalig has provided may beefits: higher capacity (desity), lower cost, lower eergy 11

12 Major Treds Affectig Mai Memory (V) DRAM scalig has already become icreasigly difficult Icreasig cell leakage curret, reduced cell reliability, icreasig maufacturig difficulties [Kim+ ISCA 2014], [Liu+ ISCA 2013], [Mutlu IMW 2013], [Mutlu DATE 2017] Difficult to sigificatly improve capacity, eergy Emergig memory techologies are promisig 3D-Stacked DRAM higher badwidth smaller capacity Reduced-Latecy DRAM (e.g., RLDRAM, TL-DRAM) Low-Power DRAM (e.g., LPDDR3, LPDDR4) No-Volatile Memory (NVM) (e.g., PCM, STTRAM, ReRAM, 3D Xpoit) lower latecy lower power larger capacity higher cost higher latecy higher cost higher latecy higher dyamic power lower edurace 12

13 Major Treds Affectig Mai Memory (V) DRAM scalig has already become icreasigly difficult Icreasig cell leakage curret, reduced cell reliability, icreasig maufacturig difficulties [Kim+ ISCA 2014], [Liu+ ISCA 2013], [Mutlu IMW 2013], [Mutlu DATE 2017] Difficult to sigificatly improve capacity, eergy Emergig memory techologies are promisig 3D-Stacked DRAM higher badwidth smaller capacity Reduced-Latecy DRAM (e.g., RLDRAM, TL-DRAM) Low-Power DRAM (e.g., LPDDR3, LPDDR4) No-Volatile Memory (NVM) (e.g., PCM, STTRAM, ReRAM, 3D Xpoit) lower latecy lower power larger capacity higher cost higher latecy higher cost higher latecy higher dyamic power lower edurace 13

14 Ageda Major Treds Affectig Mai Memory The Memory Scalig Problem Two Complemetary Solutio Directios New Memory (DRAM) Architectures Eablig Emergig (NVM) Techologies Coclusio 14

15 Limits of Charge Memory Difficult charge placemet ad cotrol Flash: floatig gate charge DRAM: capacitor charge, trasistor leakage Reliable sesig becomes difficult as charge storage uit size reduces 15

16 The DRAM Scalig Problem DRAM stores charge i a capacitor (charge-based memory) Capacitor must be large eough for reliable sesig Access trasistor should be large eough for low leakage ad high retetio time Scalig beyod 40-35m (2013) is challegig [ITRS, 2009] DRAM capacity, cost, ad eergy/power hard to scale 16

17 As Memory Scales, It Becomes Ureliable Data from all of Facebook s servers worldwide Meza+, Revisitig Memory Errors i Large-Scale Productio Data Ceters, DSN

Large-Scale Failure Aalysis of DRAM Chips Aalysis ad modelig of memory errors foud i all of Facebook s server fleet Justi Meza, Qiag Wu, Sajeev Kumar, ad Our Mutlu, "Revisitig Memory Errors i

18 Large-Scale Failure Aalysis of DRAM Chips Aalysis ad modelig of memory errors foud i all of Facebook s server fleet Justi Meza, Qiag Wu, Sajeev Kumar, ad Our Mutlu, "Revisitig Memory Errors i Large-Scale Productio Data Ceters: Aalysis ad Modelig of New Treds from the Field" Proceedigs of the 45th Aual IEEE/IFIP Iteratioal Coferece o Depedable Systems ad Networks (DSN), Rio de Jaeiro, Brazil, Jue [Slides (pptx) (pdf)] [DRAM Error Model] 18

19 Ifrastructures to Uderstad Such Issues Temperature Cotroller FPGAs Heater FPGAs PC Kim+, Flippig Bits i Memory Without Accessig Them: A Experimetal Study of DRAM Disturbace Errors, ISCA

for Eablig Experimetal DRAM Studies, HPCA 2017.

20 SoftMC: Ope Source DRAM Ifrastructure Hasa Hassa et al., SoftMC: A Flexible ad Practical OpeSource Ifrastructure for Eablig Experimetal DRAM Studies, HPCA Flexible Easy to Use (C++ API) Ope-source github.com/cmu-safari/softmc 20

21 SoftMC 21

22 A Curious Discovery [Kim et al., ISCA 2014] Oe ca predictably iduce errors i most DRAM memory chips 22

23 DRAM RowHammer A simple hardware failure mechaism ca create a widespread system security vulerability 23

24 Moder DRAM is Proe to Disturbace Errors Row of Cells Row Victim Row Row Hammered Row Row Victim Row Row Opeed Closed Wordlie V LOW HIGH Repeatedly readig a row eough times (before memory gets refreshed) iduces disturbace errors i adjacet rows i most real DRAM chips you ca buy today Flippig Bits i Memory Without Accessig Them: A Experimetal Study of DRAM Disturbace Errors, (Kim et al., ISCA 2014) 24

More o RowHammer Aalysis Yoogu Kim, Ross Daly, Jeremie Kim, Chris Falli, Ji Hye Lee, Doghyuk Lee, Chris Wilkerso, Korad Lai, ad Our Mutlu, "Flippig Bits i Memory Without Accessig Them: A Experimetal

25 More o RowHammer Aalysis Yoogu Kim, Ross Daly, Jeremie Kim, Chris Falli, Ji Hye Lee, Doghyuk Lee, Chris Wilkerso, Korad Lai, ad Our Mutlu, "Flippig Bits i Memory Without Accessig Them: A Experimetal Study of DRAM Disturbace Errors" Proceedigs of the 41st Iteratioal Symposium o Computer Architecture (ISCA), Mieapolis, MN, Jue [Slides (pptx) (pdf)] [Lightig Sessio Slides (pptx) (pdf)] [Source Code ad Data] 25

26 Idustry Is Writig Papers About It, Too 26

27 Idustry Is Writig Papers About It, Too 27

28 How Do We Solve The Memory Problem? Fix it: Make memory ad cotrollers more itelliget New iterfaces, fuctios, architectures: system-mem codesig Elimiate or miimize it: Replace or (more likely) augmet DRAM with a differet techology New techologies ad (VM, system-wide OS, MM) rethikig of memory & storage Embrace it: Desig heterogeeous Logic memories (oe of which are perfect) ad map data itelligetly across them Problems Algorithms Programs Rutime System ISA Microarchitecture Devices User New models for data maagemet ad maybe usage Solutios (to memory scalig) reuire software/hardware/device cooperatio 28

29 Ageda Major Treds Affectig Mai Memory The Memory Scalig Problem Two Complemetary Solutio Directios New Memory (DRAM) Architectures Eablig Emergig (NVM) Techologies Coclusio 29

30 Solutio 1: New Memory Architectures Overcome memory shortcomigs with Memory-cetric system desig Novel memory architectures, iterfaces, fuctios Better waste maagemet (efficiet utilizatio) Key issues to tackle Eable reliability at low cost à high capacity Reduce eergy Improve latecy ad badwidth Reduce waste (capacity, badwidth, latecy) Eable computatio close to data 30

31 Solutio 1: New Memory Architectures Liu+, RAIDR: Retetio-Aware Itelliget DRAM Refresh, ISCA Kim+, A Case for Exploitig Subarray-Level Parallelism i DRAM, ISCA Lee+, Tiered-Latecy DRAM: A Low Latecy ad Low Cost DRAM Architecture, HPCA Liu+, A Experimetal Study of Data Retetio Behavior i Moder DRAM Devices, ISCA Seshadri+, RowCloe: Fast ad Efficiet I-DRAM Copy ad Iitializatio of Bulk Data, MICRO Pekhimeko+, Liearly Compressed Pages: A Mai Memory Compressio Framework, MICRO Chag+, Improvig DRAM Performace by Parallelizig Refreshes with Accesses, HPCA Kha+, The Efficacy of Error Mitigatio Techiues for DRAM Retetio Failures: A Comparative Experimetal Study, SIGMETRICS Luo+, Characterizig Applicatio Memory Error Vulerability to Optimize Data Ceter Cost, DSN Kim+, Flippig Bits i Memory Without Accessig Them: A Experimetal Study of DRAM Disturbace Errors, ISCA Lee+, Adaptive-Latecy DRAM: Optimizig DRAM Timig for the Commo-Case, HPCA Qureshi+, AVATAR: A Variable-Retetio-Time (VRT) Aware Refresh for DRAM Systems, DSN Meza+, Revisitig Memory Errors i Large-Scale Productio Data Ceters: Aalysis ad Modelig of New Treds from the Field, DSN Kim+, Ramulator: A Fast ad Extesible DRAM Simulator, IEEE CAL Seshadri+, Fast Bulk Bitwise AND ad OR i DRAM, IEEE CAL Ah+, A Scalable Processig-i-Memory Accelerator for Parallel Graph Processig, ISCA Ah+, PIM-Eabled Istructios: A Low-Overhead, Locality-Aware Processig-i-Memory Architecture, ISCA Lee+, Decoupled Direct Memory Access: Isolatig CPU ad IO Traffic by Leveragig a Dual-Data-Port DRAM, PACT Seshadri+, Gather-Scatter DRAM: I-DRAM Address Traslatio to Improve the Spatial Locality of No-uit Strided Accesses, MICRO Lee+, Simultaeous Multi-Layer Access: Improvig 3D-Stacked Memory Badwidth at Low Cost, TACO Hassa+, ChargeCache: Reducig DRAM Latecy by Exploitig Row Access Locality, HPCA Chag+, Low-Cost Iter-Liked Subarrays (LISA): Eablig Fast Iter-Subarray Data Migratio i DRAM, HPCA Chag+, Uderstadig Latecy Variatio i Moder DRAM Chips Experimetal Characterizatio, Aalysis, ad Optimizatio, SIGMETRICS Kha+, PARBOR: A Efficiet System-Level Techiue to Detect Data Depedet Failures i DRAM, DSN Hsieh+, Trasparet Offloadig ad Mappig (TOM): Eablig Programmer-Trasparet Near-Data Processig i GPU Systems, ISCA Hashemi+, Acceleratig Depedet Cache Misses with a Ehaced Memory Cotroller, ISCA Boroumad+, LazyPIM: A Efficiet Cache Coherece Mechaism for Processig-i-Memory, IEEE CAL Pattaik+, Schedulig Techiues for GPU Architectures with Processig-I-Memory Capabilities, PACT Hsieh+, Acceleratig Poiter Chasig i 3D-Stacked Memory: Challeges, Mechaisms, Evaluatio, ICCD Hashemi+, Cotiuous Ruahead: Trasparet Hardware Acceleratio for Memory Itesive Workloads, MICRO Kha+, A Case for Memory Cotet-Based Detectio ad Mitigatio of Data-Depedet Failures i DRAM", IEEE CAL Hassa+, SoftMC: A Flexible ad Practical Ope-Source Ifrastructure for Eablig Experimetal DRAM Studies, HPCA Mutlu, The RowHammer Problem ad Other Issues We May Face as Memory Becomes Deser, DATE Lee+, Desig-Iduced Latecy Variatio i Moder DRAM Chips: Characterizatio, Aalysis, ad Latecy Reductio Mechaisms, SIGMETRICS Chag+, Uderstadig Reduced-Voltage Operatio i Moder DRAM Devices: Experimetal Characterizatio, Aalysis, ad Mechaisms, SIGMETRICS Patel+, The Reach Profiler (REAPER): Eablig the Mitigatio of DRAM Retetio Failures via Profilig at Aggressive Coditios, ISCA Seshadri ad Mutlu, Simple Operatios i Memory to Reduce Data Movemet, ADCOM Liu+, Cocurret Data Structures for Near-Memory Computig, SPAA Kha+, Detectig ad Mitigatig Data-Depedet DRAM Failures by Exploitig Curret Memory Cotet, MICRO Seshadri+, Ambit: I-Memory Accelerator for Bulk Bitwise Operatios Usig Commodity DRAM Techology, MICRO Avoid DRAM: Seshadri+, The Evicted-Address Filter: A Uified Mechaism to Address Both Cache Pollutio ad Thrashig, PACT Pekhimeko+, Base-Delta-Immediate Compressio: Practical Data Compressio for O-Chip Caches, PACT Seshadri+, The Dirty-Block Idex, ISCA Pekhimeko+, Exploitig Compressed Block Size as a Idicator of Future Reuse, HPCA Vijaykumar+, A Case for Core-Assisted Bottleeck Acceleratio i GPUs: Eablig Flexible Data Compressio with Assist Warps, ISCA Pekhimeko+, Toggle-Aware Badwidth Compressio for GPUs, HPCA

32 Example: (Truly) I-Memory Computatio We ca support i-dram COPY, ZERO, AND, OR, NOT, MAJ At low cost Usig aalog computatio capability of DRAM Idea: activatig multiple rows performs computatio 30-60X performace ad eergy improvemet Seshadri+, Ambit: I-Memory Accelerator for Bulk Bitwise Operatios Usig Commodity DRAM Techology, MICRO New memory techologies eable eve more opportuities Memristors, resistive RAM, phase chage mem, STT-MRAM, Ca operate o data with miimal movemet 32

33 Performace: I-DRAM Bitwise Operatios 33

34 Eergy of I-DRAM Bitwise Operatios Seshadri+, Ambit: I-Memory Accelerator for Bulk Bitwise Operatios usig Commodity DRAM Techology, MICRO

35 More o Ambit Vivek Seshadri et al., Ambit: I-Memory Accelerator for Bulk Bitwise Operatios Usig Commodity DRAM Techology, MICRO

Tesseract System for Graph Processig Itercoected set of

Processor Memory-Mapped Accelerator Iterface Nocacheable,

Core LP PF Buffer MTP Message Queue DRAM Cotroller NI Ah+, A

36 Tesseract System for Graph Processig Itercoected set of 3D-stacked memory+logic chips with simple cores Host Processor Memory-Mapped Accelerator Iterface Nocacheable, Physically Addressed) Memory Logic Crossbar Network I-Order Core LP PF Buffer MTP Message Queue DRAM Cotroller NI Ah+, A Scalable Processig-i-Memory Accelerator for Parallel Graph Processig ISCA 2015.

37 Tesseract Graph Processig Performace Speedup >13X Performace Improvemet O five graph processig algorithms 11.6x 9.0x +56% +25% 13.8x 0 DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- LP Tesseract- LP-MTP Ah+, A Scalable Processig-i-Memory Accelerator for Parallel Graph Processig ISCA 2015.

38 Tesseract Graph Processig System Eergy 1.2 Memory Layers Logic Layers Cores HMC-OoO > 8X Eergy Reductio Tesseract with Prefetchig Ah+, A Scalable Processig-i-Memory Accelerator for Parallel Graph Processig ISCA 2015.

39 More o Tesseract Juwha Ah, Sugpack Hog, Sugjoo Yoo, Our Mutlu, ad Kiyoug Choi, "A Scalable Processig-i-Memory Accelerator for Parallel Graph Processig" Proceedigs of the 42d Iteratioal Symposium o Computer Architecture (ISCA), Portlad, OR, Jue [Slides (pdf)] [Lightig Sessio Slides (pdf)] 39

40 Acceleratig Poiter Chasig with PIM Kevi Hsieh, Samira Kha, Nadita Vijaykumar, Kevi K. Chag, Amirali Boroumad, Saugata Ghose, ad Our Mutlu, "Acceleratig Poiter Chasig i 3D-Stacked Memory: Challeges, Mechaisms, Evaluatio" Proceedigs of the 34th IEEE Iteratioal Coferece o Computer Desig (ICCD), Phoeix, AZ, USA, October

41 Ageda Major Treds Affectig Mai Memory The Memory Scalig Problem Two Complemetary Solutio Directios New Memory (DRAM) Architectures Eablig Emergig (NVM) Techologies Coclusio 41

42 Limits of Charge Memory Difficult charge placemet ad cotrol Flash: floatig gate charge DRAM: capacitor charge, trasistor leakage Reliable sesig becomes difficult as charge storage uit size reduces 42

43 Solutio 2: Emergig Memory Techologies Some emergig resistive memory techologies seem more scalable tha DRAM (ad they are o-volatile) Example: Phase Chage Memory Data stored by chagig phase of material Data read by detectig material s resistace Expected to scale to 9m (2022 [ITRS 2009]) Prototyped at 20m (Raoux+, IBM JRD 2008) Expected to be deser tha DRAM: ca store multiple bits/cell But, emergig techologies have (may) shortcomigs Ca they be eabled to replace/augmet/surpass DRAM? 43

44 Solutio 2: Emergig Memory Techologies Lee+, Architectig Phase Chage Memory as a Scalable DRAM Alterative, ISCA 09, CACM 10, IEEE Micro 10. Meza+, Eablig Efficiet ad Scalable Hybrid Memories, IEEE Comp. Arch. Letters Yoo, Meza+, Row Buffer Locality Aware Cachig Policies for Hybrid Memories, ICCD Kultursay+, Evaluatig STT-RAM as a Eergy-Efficiet Mai Memory Alterative, ISPASS Meza+, A Case for Efficiet Hardware-Software Cooperative Maagemet of Storage ad Memory, WEED Lu+, Loose Orderig Cosistecy for Persistet Memory, ICCD Zhao+, FIRM: Fair ad High-Performace Memory Cotrol for Persistet Memory Systems, MICRO Yoo, Meza+, Efficiet Data Mappig ad Bufferig Techiues for Multi-Level Cell Phase-Chage Memories, TACO Re+, ThyNVM: Eablig Software-Trasparet Crash Cosistecy i Persistet Memory Systems, MICRO Chauha+, NVMove: Helpig Programmers Move to Byte-Based Persistece, INFLOW Li+, Utility-Based Hybrid Memory Maagemet, CLUSTER Yu+, Bashee: Badwidth-Efficiet DRAM Cachig via Software/Hardware Cooperatio, MICRO

45 Promisig Resistive Memory Techologies PCM Iject curret to chage material phase Resistace determied by phase STT-MRAM Iject curret to chage maget polarity Resistace determied by polarity Memristors/RRAM/ReRAM Iject curret to chage atomic structure Resistace determied by atom distace 45

46 What is Phase Chage Memory? Phase chage material (chalcogeide glass) exists i two states: Amorphous: Low optical reflexivity ad high electrical resistivity Crystallie: High optical reflexivity ad low electrical resistivity PCM is resistive memory: High resistace (0), Low resistace (1) PCM cell ca be switched betwee states reliably ad uickly 46

above Tmelt ad ueched Read: detect phase via material resistace amorphous/crystallie Large Curret Small

47 How Does PCM Work? Write: chage phase via curret ijectio SET: sustaied curret to heat cell above Tcryst RESET: cell heated above Tmelt ad ueched Read: detect phase via material resistace amorphous/crystallie Large Curret Small Curret Memory Elemet SET (cryst) Low resistace Access Device RESET (amorph) High resistace W Photo Courtesy: Bipi Rajedra, IBM Slide Courtesy: Moiuddi Qureshi, IBM W 47

48 Opportuity: PCM Advatages Scales better tha DRAM, Flash Reuires curret pulses, which scale liearly with feature size Expected to scale to 9m (2022 [ITRS]) Prototyped at 20m (Raoux+, IBM JRD 2008) Ca be deser tha DRAM Ca store multiple bits per cell due to large resistace rage Prototypes with 2 bits/cell i ISSCC 08, 4 bits/cell by 2012 No-volatile Retai data for >10 years at 85C No refresh eeded, low idle power 48

49 Phase Chage Memory Properties Surveyed prototypes from (ITRS, IEDM, VLSI, ISSCC) Derived PCM parameters for F=90m Lee, Ipek, Mutlu, Burger, Architectig Phase Chage Memory as a Scalable DRAM Alterative, ISCA Lee et al., Phase Chage Techology ad the Future of Mai Memory, IEEE Micro Top Picks

50 50

51 Phase Chage Memory: Pros ad Cos Pros over DRAM Better techology scalig (capacity ad cost) No volatile à Persistet Low idle power (o refresh) Cos Higher latecies: ~4-15x DRAM (especially write) Higher active eergy: ~2-50x DRAM (especially write) Lower edurace (a cell dies after ~10 8 writes) Reliability issues (resistace drift) Challeges i eablig PCM as DRAM replacemet/helper: Mitigate PCM shortcomigs Fid the right way to place PCM i the system 51

52 PCM-based Mai Memory (I) How should PCM-based (mai) memory be orgaized? Hybrid PCM+DRAM [Qureshi+ ISCA 09, Dhima+ DAC 09]: How to partitio/migrate data betwee PCM ad DRAM 52

53 PCM-based Mai Memory (II) How should PCM-based (mai) memory be orgaized? Pure PCM mai memory [Lee et al., ISCA 09, Top Picks 10]: How to redesig etire hierarchy (ad cores) to overcome PCM shortcomigs 53

54 A Iitial Study: Replace DRAM with PCM Lee, Ipek, Mutlu, Burger, Architectig Phase Chage Memory as a Scalable DRAM Alterative, ISCA Surveyed prototypes from (e.g. IEDM, VLSI, ISSCC) Derived average PCM parameters for F=90m 54

55 Results: Naïve Replacemet of DRAM with PCM Replace DRAM with PCM i a 4-core, 4MB L2 system PCM orgaized the same as DRAM: row buffers, baks, peripherals 1.6x delay, 2.2x eergy, 500-hour average lifetime Lee, Ipek, Mutlu, Burger, Architectig Phase Chage Memory as a Scalable DRAM Alterative, ISCA

56 Architectig PCM to Mitigate Shortcomigs Idea 1: Use multiple arrow row buffers i each PCM chip à Reduces array reads/writes à better edurace, latecy, eergy Idea 2: Write ito array at cache block or word graularity à Reduces uecessary wear DRAM PCM 56

57 Results: Architected PCM as Mai Memory 1.2x delay, 1.0x eergy, 5.6-year average lifetime Scalig improves eergy, edurace, desity Caveat 1: Worst-case lifetime is much shorter (o guaratees) Caveat 2: Itesive applicatios see large performace ad eergy hits Caveat 3: Optimistic PCM parameters? 57

58 More o PCM As Mai Memory Bejami C. Lee, Egi Ipek, Our Mutlu, ad Doug Burger, "Architectig Phase Chage Memory as a Scalable DRAM Alterative" Proceedigs of the 36th Iteratioal Symposium o Computer Architecture (ISCA), pages 2-13, Austi, TX, Jue Slides (pdf) 58

59 More o PCM As Mai Memory (II) Bejami C. Lee, Pig Zhou, Ju Yag, Youtao Zhag, Bo Zhao, Egi Ipek, Our Mutlu, ad Doug Burger, "Phase Chage Techology ad the Future of Mai Memory" IEEE Micro, Special Issue: Micro's Top Picks from 2009 Computer Architecture Cofereces (MICRO TOP PICKS), Vol. 30, No. 1, pages 60-70, Jauary/February

60 STT-MRAM as Mai Memory Magetic Tuel Juctio (MTJ) device Referece layer: Fixed magetic orietatio Free layer: Parallel or ati-parallel Magetic orietatio of the free layer determies logical state of device High vs. low resistace Write: Push large curret through MTJ to chage orietatio of free layer Read: Sese curret flow Logical 0 Referece Layer Barrier Free Layer Logical 1 Referece Layer Barrier Free Layer Word Lie MTJ Kultursay et al., Evaluatig STT-RAM as a Eergy- Efficiet Mai Memory Alterative, ISPASS Access Trasistor Bit Lie Sese Lie

61 STT-MRAM: Pros ad Cos Pros over DRAM Better techology scalig (capacity ad cost) No volatile à Persistet Low idle power (o refresh) Cos Higher write latecy Higher write eergy Poor desity (curretly) Reliability? Aother level of freedom Ca trade off o-volatility for lower write latecy/eergy (by reducig the size of the MTJ) 61

62 Architected STT-MRAM as Mai Memory 4-core, 4GB mai memory, multiprogrammed workloads ~6% performace loss, ~60% eergy savigs vs. DRAM Performace vs. DRAM 98% 96% 94% 92% 90% 88% STT-RAM (base) STT-RAM (opt) ACT+PRE WB RB Eergy vs. DRAM 100% 80% 60% 40% 20% 0% Kultursay+, Evaluatig STT-RAM as a Eergy-Efficiet Mai Memory Alterative, ISPASS

63 More o STT-MRAM as Mai Memory Emre Kultursay, Mahmut Kademir, Aad Sivasubramaiam, ad Our Mutlu, "Evaluatig STT-RAM as a Eergy-Efficiet Mai Memory Alterative" Proceedigs of the 2013 IEEE Iteratioal Symposium o Performace Aalysis of Systems ad Software (ISPASS), Austi, TX, April Slides (pptx) (pdf) 63

64 A More Viable Approach: Hybrid Memory Systems CPU DRAM PCM DRAM Ctrl Ctrl Phase Chage Memory (or Tech. X) Fast, durable Small, leaky, volatile, high-cost Large, o-volatile, low-cost Slow, wears out, high active eergy Hardware/software maage data allocatio ad movemet to achieve the best of multiple techologies Meza+, Eablig Efficiet ad Scalable Hybrid Memories, IEEE Comp. Arch. Letters, Yoo+, Row Buffer Locality Aware Cachig Policies for Hybrid Memories, ICCD 2012 Best Paper Award.

65 Challege ad Opportuity Providig the Best of Multiple Metrics with Multiple Memory Techologies 65

66 Challege ad Opportuity Heterogeeous, Cofigurable, Programmable Memory Systems 66

67 Hybrid Memory Systems: Issues Cache vs. Mai Memory Graularity of Data Move/Maage-met: Fie or Coarse Hardware vs. Software vs. HW/SW Cooperative Whe to migrate data? How to desig a scalable ad efficiet large cache? 67

Data Placemet i Hybrid Memory Cores/Caches Chael A Memory Cotrollers IDLE Chael B Memory A (Fast, Small) Memory A is fast, but small Load should be balaced o both

68 Data Placemet i Hybrid Memory Cores/Caches Chael A Memory Cotrollers IDLE Chael B Memory A (Fast, Small) Memory A is fast, but small Load should be balaced o both chaels Memory B (Large, Slow) Page 1 Page 2 Which memory do we place each page i, to maximize system performace? Page migratios have performace ad eergy overhead 68

69 Data Placemet Betwee DRAM ad PCM Idea: Characterize data access patters ad guide data placemet i hybrid memory Streamig accesses: As fast i PCM as i DRAM Radom accesses: Much faster i DRAM Idea: Place radom access data with some reuse i DRAM; streamig data i PCM Yoo+, Row Buffer Locality-Aware Data Placemet i Hybrid Memories, ICCD 2012 Best Paper Award. 69

70 Hybrid vs. All-PCM/DRAM [ICCD 12] Normalized Weighted Speedup GB PCM RBLA-Dy 16GB DRAM 31% 29% Normalized Max. Slowdow % better performace tha all PCM, withi 29% of all DRAM performace Weighted Speedup Max. Slowdow Perf. per Watt 0 Normalized Metric 0 Yoo+, Row Buffer Locality-Aware Data Placemet i Hybrid Memories, ICCD 2012 Best Paper Award. 1

More o Hybrid Memory Data Placemet HaBi Yoo, Justi Meza, Rachata Ausavarugiru, Rachael Hardig, ad Our Mutlu, "Row Buffer Locality Aware Cachig Policies

71 More o Hybrid Memory Data Placemet HaBi Yoo, Justi Meza, Rachata Ausavarugiru, Rachael Hardig, ad Our Mutlu, "Row Buffer Locality Aware Cachig Policies for Hybrid Memories" Proceedigs of the 30th IEEE Iteratioal Coferece o Computer Desig (ICCD), Motreal, Quebec, Caada, September Slides (pptx) (pdf) 71

72 Weakesses of Existig Solutios They are all heuristics that cosider oly a limited part of memory access behavior Do ot directly capture the overall system performace impact of data placemet decisios Example: Noe capture memory-level parallelism (MLP) Number of cocurret memory reuests from the same applicatio whe a page is accessed Affects how much page migratio helps performace 72

73 Importace of Memory-Level Parallelism Before migratio: Before migratio: reuests to Page 1 Mem. B reuests to Page 2 Mem. B reuests to Page 3 Mem. B After migratio: After migratio: reuests to Page 1 Mem. A reuests to Page 2 Mem. A T reuests to Page 3 Mem. Mem. A BT time Migratig oe page reduces stall time by T time Must migrate two pages to reduce stall time by T: migratig oe page aloe does ot help Page migratio decisios eed to cosider MLP 73

74 Our Goal [CLUSTER 2017] A geeralized mechaism that 1. Directly estimates the performace beefit of migratig a page betwee ay two types of memory 2. Places oly the performace-critical data i the fast memory 74

75 Utility-Based Hybrid Memory Maagemet A memory maager that works for ay hybrid memory e.g., DRAM-NVM, DRAM-RLDRAM Key Idea For each page, use comprehesive characteristics to calculate estimated utility (i.e., performace impact) of migratig page from oe memory to the other i the system Migrate oly pages with the highest utility (i.e., pages that improve system performace the most whe migrated) Li+, Utility-Based Hybrid Memory Maagemet, CLUSTER

76 Key Mechaisms of UH-MEM For each page, estimate utility usig a performace model Applicatio stall time reductio How much would migratig a page beefit the performace of the applicatio that the page belogs to? Applicatio performace sesitivity How much does the improvemet of a sigle applicatio s performace icrease the overall system performace? Utility = StallTime 0 Sesitivity 0 Migrate oly pages whose utility exceed the migratio threshold from slow memory to fast memory Periodically adjust migratio threshold 76

77 Results: System Performace ALL FREQ RBLA UH-MEM Normalized Weighted Speedup % 9% 3% 5% 0% 25% 50% 75% 100% Workload Memory Itesity Category UH-MEM improves system performace over the best state-of-the-art hybrid memory maager 77

78 Results: Sesitivity to Slow Memory Latecy We vary t 567 ad t 85 of the slow memory Weighted Speedup t 567 : t 85 : ALL FREQ RBLA UH-MEM 8% 6% 14% 13% 13% x3.0 x3.0 x4.0 x4.0 x4.5 x12 x6.0 x16 x7.5 x20 Slow Memory Latecy Multiplier UH-MEM improves system performace for a wide variety of hybrid memory systems 78

79 More o UH-MEM Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, ad Our Mutlu, "Utility-Based Hybrid Memory Maagemet" Proceedigs of the 19th IEEE Cluster Coferece (CLUSTER), Hoolulu, Hawaii, USA, September [Slides (pptx) (pdf)] 79

80 Challege ad Opportuity Eablig a Emergig Techology to Augmet DRAM Maagig Hybrid Memories 80

81 Aother Challege Desigig Effective Large (DRAM) Caches 81

82 Oe Problem with Large DRAM Caches A large DRAM cache reuires a large metadata (tag + block-based iformatio) store How do we desig a efficiet DRAM cache? Metadata: X à DRAM Mem Ctlr CPU LOAD X Mem Ctlr DRAM X (small, fast cache) Access X PCM (high capacity) 82

83 Idea 1: Tags i Memory Store tags i the same row as data i DRAM Store metadata i same row as their data Data ad metadata ca be accessed together DRAM row Cache block 0 Cache block 1 Cache block 2 Tag 0 Tag 1 Tag 2 Beefit: No o-chip tag storage overhead Dowsides: Cache hit determied oly after a DRAM access Cache hit reuires two DRAM accesses 83

84 Idea 2: Cache Tags i SRAM Recall Idea 1: Store all metadata i DRAM To reduce metadata storage overhead Idea 2: Cache i o-chip SRAM freuetly-accessed metadata Cache oly a small amout to keep SRAM size small 84

85 O Large DRAM Cache Desig Justi Meza, Jichua Chag, HaBi Yoo, Our Mutlu, ad Parthasarathy Ragaatha, "Eablig Efficiet ad Scalable Hybrid Memories Usig Fie-Graularity DRAM Cache Maagemet" IEEE Computer Architecture Letters (CAL), February

86 DRAM Caches: May Recet Optios Yu+, Bashee: Badwidth-Efficiet DRAM Cachig via Software/Hardware Cooperatio, MICRO

87 Bashee [MICRO 2017] Tracks presece i cache usig TLB ad Page Table No tag store eeded for DRAM cache Eabled by a ew lightweight lazy TLB coherece protocol New badwidth-aware freuecy-based replacemet policy 87

88 More o Bashee Xiagyao Yu, Christopher J. Hughes, Nadathur Satish, Our Mutlu, ad Sriivas Devadas, "Bashee: Badwidth-Efficiet DRAM Cachig via Software/Hardware Cooperatio" Proceedigs of the 50th Iteratioal Symposium o Microarchitecture (MICRO), Bosto, MA, USA, October

89 Other Opportuities with Emergig Techologies Mergig of memory ad storage e.g., a sigle iterface to maage all data New applicatios e.g., ultra-fast checkpoit ad restore More robust system desig e.g., reducig data loss Processig tightly-coupled with memory e.g., eablig efficiet search ad filterig 89

90 TWO-LEVEL STORAGE MODEL MEMORY CPU STORAGE Ld/St DRAM FILE I/O VOLATILE FAST BYTE ADDR NONVOLATILE SLOW BLOCK ADDR 90

91 TWO-LEVEL STORAGE MODEL MEMORY STORAGE CPU Ld/St DRAM FILE I/O NVM PCM, STT-RAM VOLATILE FAST BYTE ADDR NONVOLATILE SLOW BLOCK ADDR No-volatile memories combie characteristics of memory ad storage 91

Two-Level Memory/Storage Model The traditioal two-level storage

iterface Persistet data i storage à a file system iterface Problem:

buffer data become performace ad eergy bottleecks with fast NVM

fwrite, Operatig system ad file system Address traslatio Processor

92 Two-Level Memory/Storage Model The traditioal two-level storage model is a bottleeck with NVM Volatile data i memory à a load/store iterface Persistet data i storage à a file system iterface Problem: Operatig system (OS) ad file system (FS) code to locate, traslate, buffer data become performace ad eergy bottleecks with fast NVM stores Two-Level Store Virtual memory Load/Store fope, fread, fwrite, Operatig system ad file system Address traslatio Processor ad caches Mai Memory Persistet (e.g., Phase-Chage) Storage Memory (SSD/HDD) 92

93 Uified Memory ad Storage with NVM Goal: Uify memory ad storage maagemet i a sigle uit to elimiate wasted work to locate, trasfer, ad traslate data Improves both eergy ad performace Simplifies programmig model as well Uified Memory/Storage Persistet Memory Maager Processor ad caches Load/Store Feedback Persistet (e.g., Phase-Chage) Memory Meza+, A Case for Efficiet Hardware-Software Cooperative Maagemet of Storage ad Memory, WEED

94 The Persistet Memory Maager (PMM) 1 it mai(void) { 2 // data i file.dat is persistet 3 FILE mydata = "file.dat"; Persistet objects 4 mydata = ew it[64]; 5 } 6 void updatevalue(it, it value) { 7 FILE mydata = "file.dat"; 8 mydata[] = value; // value is persistet 9 } Load Figure 2: Sample program with access to fi Store Hits from SW/OS/rutime Software Hardware Persistet Memory Maager Data Layout, Persistece, Metadata, Security,... DRAM Flash NVM HDD PMM uses access ad hit iformatio to allocate, locate, migrate ad access data i the heterogeeous array of devices 94

95 Performace Beefits of a Sigle-Level Store User CPU User Memory Syscall CPU Syscall I/O 1.0 ~24X Normalized Executio Time ~5X HDD 2-level NVM 2-level Persistet Memory Meza+, A Case for Efficiet Hardware-Software Cooperative Maagemet of Storage ad Memory, WEED

96 Eergy Beefits of a Sigle-Level Store User CPU Syscall CPU DRAM NVM HDD 1.0 ~16X Fractio of Total Eergy ~5X HDD 2-level NVM 2-level Persistet Memory Meza+, A Case for Efficiet Hardware-Software Cooperative Maagemet of Storage ad Memory, WEED

97 O Persistet Memory Beefits & Challeges Justi Meza, Yixi Luo, Samira Kha, Jishe Zhao, Yua Xie, ad Our Mutlu, "A Case for Efficiet Hardware-Software Cooperative Maagemet of Storage ad Memory" Proceedigs of the 5th Workshop o Eergy-Efficiet Desig (WEED), Tel-Aviv, Israel, Jue Slides (pptx) Slides (pdf) 97

98 Challege ad Opportuity Combied Memory & Storage 98

99 Challege ad Opportuity A Uified Iterface to All Data 99

100 Oe Key Challege i Persistet Memory How to esure cosistecy of system/data if all memory is persistet? Two extremes Programmer trasparet: Let the system hadle it Programmer oly: Let the programmer hadle it May alteratives i-betwee 100

101 CHALLENGE: CRASH CONSISTENCY Persistet Memory System System crash ca result i permaet data corruptio i NVM 101

102 CRASH CONSISTENCY PROBLEM Example: Add a ode to a liked list 2. Lik to prev 1. Lik to ext System crash ca result i icosistet memory state 102

103 CURRENT SOLUTIONS Explicit iterfaces to maage cosistecy NV-Heaps [ASPLOS 11], BPFS [SOSP 09], Memosye [ASPLOS 11] AtomicBegi { Isert a ew ode; } AtomicEd; Limits adoptio of NVM Have to rewrite code with clear partitio betwee volatile ad o-volatile data Burde o the programmers 103

104 OUR APPROACH: ThyNVM Goal: Software trasparet cosistecy i persistet memory systems 104

ThyNVM: Summary A ew hardware-based checkpoitig mechaism Checkpoits at multiple graularities to reduce both checkpoitig latecy ad metadata overhead Overlaps

105 ThyNVM: Summary A ew hardware-based checkpoitig mechaism Checkpoits at multiple graularities to reduce both checkpoitig latecy ad metadata overhead Overlaps checkpoitig ad executio to reduce checkpoitig latecy Adapts to DRAM ad NVM characteristics Performs withi 4.9% of a idealized DRAM with zero cost cosistecy 105

106 More About ThyNVM Jiglei Re, Jishe Zhao, Samira Kha, Jogmoo Choi, Yogwei Wu, ad Our Mutlu, "ThyNVM: Eablig Software-Trasparet Crash Cosistecy i Persistet Memory Systems" Proceedigs of the 48th Iteratioal Symposium o Microarchitecture (MICRO), Waikiki, Hawaii, USA, December [Slides (pptx) (pdf)] [Lightig Sessio Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code] 106

107 Aother Key Challege i Persistet Memory Programmig Ease to Exploit Persistece 107

108 Tools/Libraries to Help Programmers Himashu Chauha, Iria Calciu, Vijay Chidambaram, Eric Schkufza, Our Mutlu, ad Pratap Subrahmayam, "NVMove: Helpig Programmers Move to Byte-Based Persistece" Proceedigs of the 4th Workshop o Iteractios of NVM/Flash with Operatig Systems ad Workloads (INFLOW), Savaah, GA, USA, November [Slides (pptx) (pdf)] 108

109 Ageda Major Treds Affectig Mai Memory The Memory Scalig Problem Two Complemetary Solutio Directios New Memory (DRAM) Architectures Eablig Emergig (NVM) Techologies Coclusio 109

110 The Future of Emergig Techologies is Bright Regardless of challeges i uderlyig techology ad overlyig problems/reuiremets Ca eable: - Orders of magitude improvemets - New applicatios ad computig systems Problem Algorithm Program/Laguage System Software SW/HW Iterface Micro-architecture Logic Devices Electros Yet, we have to - Thik across the stack - Desig eablig systems 110

111 If I Doubt, Refer to Flash Memory A very doubtful emergig techology for at least two decades Proceedigs of the IEEE, Sept

112 Opportuities ad Challeges of Emergig Memory Techologies Our Mutlu September 11, 2017 ARM Research Summit

113 Overview Paper o Flash Reliability Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixi Luo, ad Our Mutlu, "Error Characterizatio, Mitigatio, ad Recovery i Flash Memory Based Solid State Drives" to appear i Proceedigs of the IEEE,

114 Solutio 2: Emergig Memory Techologies Some emergig resistive memory techologies seem more scalable tha DRAM (ad they are o-volatile) Example: Phase Chage Memory Expected to scale to 9m (2022 [ITRS]) Expected to be deser tha DRAM: ca store multiple bits/cell But, emergig techologies have shortcomigs as well Ca they be eabled to replace/augmet/surpass DRAM? Lee+, Architectig Phase Chage Memory as a Scalable DRAM Alterative, ISCA 09, CACM 10, IEEE Micro 10. Meza+, Eablig Efficiet ad Scalable Hybrid Memories, IEEE Comp. Arch. Letters Yoo, Meza+, Row Buffer Locality Aware Cachig Policies for Hybrid Memories, ICCD Kultursay+, Evaluatig STT-RAM as a Eergy-Efficiet Mai Memory Alterative, ISPASS Meza+, A Case for Efficiet Hardware-Software Cooperative Maagemet of Storage ad Memory, WEED Lu+, Loose Orderig Cosistecy for Persistet Memory, ICCD Zhao+, FIRM: Fair ad High-Performace Memory Cotrol for Persistet Memory Systems, MICRO Yoo, Meza+, Efficiet Data Mappig ad Bufferig Techiues for Multi-Level Cell Phase-Chage Memories, TACO Re+, ThyNVM: Eablig Software-Trasparet Crash Cosistecy i Persistet Memory Systems, MICRO Chauha+, NVMove: Helpig Programmers Move to Byte-Based Persistece, INFLOW Li+, Utility-Based Hybrid Memory Maagemet, CLUSTER Yu+, Bashee: Badwidth-Efficiet DRAM Cachig via Software/Hardware Cooperatio, MICRO

115 Combiatio: Hybrid Memory Systems DRAM Fast, durable Small, leaky, volatile, high-cost DRAM Ctrl CPU PCM Ctrl Techology X (e.g., PCM) Large, o-volatile, low-cost Slow, wears out, high active eergy Hardware/software maage data allocatio ad movemet to achieve the best of multiple techologies Meza+, Eablig Efficiet ad Scalable Hybrid Memories, IEEE Comp. Arch. Letters, Yoo, Meza et al., Row Buffer Locality Aware Cachig Policies for Hybrid Memories, ICCD 2012 Best Paper Award.

Exploitig Memory Error Tolerace with Hybrid Memory Systems Memory error vulerability Vulerable Tolerat data data Reliable memory Low-cost memory ECC Vulerable O protected Microsoft s Web Search

116 Exploitig Memory Error Tolerace with Hybrid Memory Systems Memory error vulerability Vulerable Tolerat data data Reliable memory Low-cost memory ECC Vulerable O protected Microsoft s Web Search NoECCworkload Tolerat Parity Reduces Well-tested data server hardware chips cost Less-tested by 4.7 % data chips Achieves sigle server availability target of % App/Data A App/Data B App/Data C Heterogeeous-Reliability Memory [DSN 2014] 116

More o Heterogeeous Reliability Memory Yixi Luo, Sriram Govida, Bikash Sharma, Mark Sataiello, Justi Meza, Ama Kasal, Jie Liu, Badriddie Khessib, Kushagra Vaid, ad Our Mutlu, "Characterizig

117 More o Heterogeeous Reliability Memory Yixi Luo, Sriram Govida, Bikash Sharma, Mark Sataiello, Justi Meza, Ama Kasal, Jie Liu, Badriddie Khessib, Kushagra Vaid, ad Our Mutlu, "Characterizig Applicatio Memory Error Vulerability to Optimize Data Ceter Cost via Heterogeeous-Reliability Memory" Proceedigs of the 44th Aual IEEE/IFIP Iteratioal Coferece o Depedable Systems ad Networks (DSN), Atlata, GA, Jue [Summary] [Slides (pptx) (pdf)] [Coverage o ZDNet] 117

118 Challege ad Opportuity Providig the Best of Multiple Metrics with Multiple Memory Techologies 118

119 Challege ad Opportuity Heterogeeous, Cofigurable, Programmable Memory Systems 119

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies